<<

Human and Mobile Robot Tracking in Environments with Different Scales

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Qiang , B.S.

Graduate Program in Computer Science and Engineering

The Ohio State University

2017

Dissertation Committee:

Dr. Dong Xuan, Advisor Dr. Chunyi Peng, Co-Advisor Dr. Yuan F. Zheng Dr. Teng Leng Ooi c Copyright by

Qiang Zhai

2017 Abstract

In the near future, we envision many mobile robots working for or interacting with humans in various scenarios such as guided shopping, policing, and senior care. In these scenarios, tracking humans and mobile robots is a crucial enabling technology as it provides their continuous locations and identities. We study each kind of tracking individually as humans and mobile robots differ in appearance, mobility, capability of computation and sensing, and feasibility of cooperation with tracking systems.

In addition to these distinctions, the scale of a ’s environment is another key factor that determines the appropriate tracking algorithm. In small scale environments such as halls and rooms, tracking systems need to be lightweight with real-time performance. In large scale environments such as campuses and entire buildings, tracking systems face complex and noisy backgrounds. Tracking systems need to be robust and scalable. This dissertation examines problems for human and mobile robot tracking in both small and large scale environments in order to enable future human-robot applications.

First, we study human tracking. In small scale environments like halls, multiple cameras are typically deployed to monitor most areas. We need to track multiple humans across multiple cameras accurately in real-time. We present VM-Tracking, a tracking system that achieves these goals. VM-Tracking aggregates motion sensor information from humans’ mobile devices, which accompany them everywhere, and

ii integrates it with visual data from physical locations. In large scale human track- ing, determining humans’ identities among different visual surveillance scenes is a major challenge, especially with noisy and incorrect data. Since people’s mobile de- vices connect to widespread cellular networks, we propose EV-Matching to address these practical challenges. EV-Matching is a large scale human tracking system that matches humans recorded in both cellular network and visual surveillance datasets.

It achieves efficient and robust tracking.

Next, we study mobile robot tracking. Similar to human tracking, visual cameras are widely available in small scale environments. Mobile robots are capable of compu- tation and cooperation but their on-board sensing capacities have room for improve- ment. Thus, we present S-Mirror, a novel approach using infrastructural cameras to

“reflect” sensing signals towards mobile robots, which greatly extends their sensing abilities. S-Mirror is a lightweight infrastructure that is easily deployed as it lever- ages existing visual camera networks, leaving most computation to robots. In large scale mobile robot tracking, scalability is a major challenge as infrastructure does not cover all areas and augmenting it is expensive. Since robots have built-in sensing capabilities, we propose BridgeLoc, a novel vision-based robot tracking system that integrates both robots’ and infrastructural camera views. We achieve accurate view bridging via visual tags, images with special patterns. We develop a key technology, rotation symmetric visual tag design, that greatly extends BridgeLoc’s scalability.

In this dissertation, we study human and mobile robot tracking in small and large scale environments. We design and implement all of the above systems. Our real- world experimental evaluations show the advantages of our work and demonstrate its potential for human-robot applications.

iii This is dedicated to my wife and my parents

iv Acknowledgments

Many people have helped me to reach this point after a long and fruitful journey.

Here I would like to express my sincere gratitude to all the people who have helped me during my Ph.D. study.

First I would like to express my sincere gratitude to my advisor, Dr. Dong Xuan for his guidance and support throughout my Ph.D. study at The Ohio State University

(OSU). In fact, I came to OSU without funding support or master degree. Normally a professor was unlikely to acquire a student without master degree to his/her research group with funding support. However, Dr. Xuan gave me opportunity to work with him for a quarter and after that quarter he decided to give me continuous funding support to finish my Ph.D. study. I was always grateful to him for the opportunity and trust he gave to me. During my Ph.D. study, Dr. Xuan had been providing me with a lot of insightful advice and helping me out of many difficulties in my research.

His diligence and enthusiasm towards research has deeply influenced me. Beyond research, Dr. Xuan is also a good friend in daily life. His suggestions, patience and support helped me overcome many difficult and unexpected situations. All the wonderful Thanksgiving, Christmas, and Chinese New Year parties at Dr. Xuan’s home left wonderful memories in my mind. I feel so blessed to have Dr. Xuan as my advisor.

v I also would like to greatly thank my co-advisor, Dr. Chunyi Peng. I met Dr.

Peng in her faculty talk of Computer Science and Engineering (CSE) Department at OSU. I was impressed by her excellent achievements and huge enthusiasm on academic research. After she became a faculty member in CSE department, I had an opportunity to work with her on a research paper. During this work, she taught me many necessary skills for research including systematic way of defining a problem and solid expression of writing a paper. After our first collaboration, she offered me her co-advising, which was a big affirmation to me. I certainly took this honor and kept learning from her. Beyond research, Dr. Peng is also a good friend to me. She cared about my daily life and gave me many good suggestions to my future career. I am fortunate to have her as my co-advisor.

Beside my advisors, I also would like to thank many other faculty members at

OSU. Particularly, I would like to thank Dr. Yuan F. Zheng. I have collaborated with Dr. Zheng many times and he has given me a lot of invaluable advice during research. I have learned plenty of knowledge in from Dr. Zheng and my research vision has been broadened. Besides, I need to thank Dr. Kannan

Srinivasan for his incisive comments during my candidacy exam. I also thank Dr. Wei

Zhao from The University of Macau for his help and insightful advice in my research.

It is a great honor to work with Dr. Zhao.

In addition, I would like to thank my collaborators, especially, Dr. Boying ,

Dr. Teng, Dr. Xinfeng , Dr. Junda Zhu, Dr. Adam C. Champion, Dr. Gang

Li, Dr. Fan , Dr. Sihao Ding, Guoxing Chen, Zhang, Ying Li, Xingya

Zhao and Quanyi Hu. Without their selfless help, it would be impossible for me to

vi finish all the research work in this dissertation. I also would like to thank Qijing Shen for his assistance during my PhD defense.

Finally, I am very grateful for the unconditional love and support from my wife and my parents. I cannot finish my Ph.D. study without . I love you.

vii Vita

April 9, 1988 ...... Born - Baotou, China

2011 ...... B.S. Information Security, Shanghai University, Shanghai, China 2011-present ...... PhD Candidate, Computer Science and Engineering, The Ohio State University

Publications

Research Publications - Conference

Gang Li (co-primary author), Fan Yang (co-primary author), Guoxing Chen (co- primary author), Qiang Zhai (co-primary author), Xinfeng Li, Jin Teng, Junda Zhu, Dong Xuan, Biao Chen and Wei Zhao. “EV-Matching: Bridging Large Visual Data and Electronic Data for Efficient Surveillance”. in Proc. of IEEE International Conference on Distributed Computing Systems (ICDCS), June 2017.

Qiang Zhai, Fan Yang, Adam C. Champion, Chunyi Peng, Junda Zhu, Dong Xuan, Biao Chen and Wei Zhao. “S-Mirror: Mirroring Sensing Signals for Mobile Robots in Indoor Environments”. in Proc. of IEEE International Conference on Mobile Ad-hoc and Sensor Networks (MSN), December 2016.

Fan Yang, Qiang Zhai, Guoxing Chen, Adam C. Champion, Junda Zhu and Dong Xuan. “Flash-Loc: Flashing Mobile Phones for Accurate Indoor Localization”. in Proc. of IEEE International Conference on Computer Communications (INFO- COM), April 2016.

Qiang Zhai, Sihao Ding, Xinfeng Li, Fan Yang, Jin Teng, Junda Zhu, Dong Xuan, Yuan F. Zheng and Wei Zhao. “VM-Tracking: Visual-Motion Sensing Integration

viii for Real-time Human Tracking”. in Proc. of IEEE International Conference on Computer Communications (INFOCOM), April 2015.

Ying Li, Sihao Ding, Qiang Zhai, Yuan F. Zheng and Dong Xuan. “Human Feet Tracking Guided by Locomotion Model”. in Proc. of IEEE International Conference on Robotics and Automation (ICRA), May 2015.

Sihao Ding, Qiang Zhai, Yuan F. Zheng and Dong Xuan. “Side-view Face Authen- tication Based on Wavelet and Random Forest with Subsets”. in Proc. of IEEE International Conference on Intelligence and Security Informatics (ISI), June 2013.

Xinfeng Li, Jin Teng, Qiang Zhai, Junda Zhu, Dong Xuan, Yuan F. Zheng and Wei Zhao. “EV-Human: Human Localization via Visual Estimation of Body Electronic Interference”. in Proc. of the mini-conference conjunction with IEEE International Conference on Computer Communications (INFOCOM), April 2013.

Adam C. Champion, Xinfeng Li, Qiang Zhai, Jin Teng and Dong Xuan. “Enclave: Promoting Unobtrusive and Secure Mobile Communications with a Ubiquitous Elec- tronic World”. in Proc. of the International Conference on Wireless Algorithms, Systems, and Applications (WASA), August 2012. (Best Paper Runner-up)

Research Publications - Journal

Sihao Ding, Gang Li, Ying Li, Xinfeng Li, Qiang Zhai, Adam C. Champion, Junda Zhu, Dong Xuan and Yuan F. Zheng. “Survsurf: Human Retrieval on Large Surveil- lance Video Data”. in Multimedia Tools and Applications (MTA) 76(5): 6521-6549, 2017.

Sihao Ding, Qiang Zhai, Ying Li, Junda Zhu, Yuan F. Zheng and Dong Xuan. “- multaneous Body Part and Motion Identification for Human-following Robots”. in Pattern Recognition (PR), Elsevier, Vol. 50, Feb. 2016, pp.118-130.

Fields of Study

Major Field: Computer Science and Engineering

ix Studies in: Computer Networking Artificial Intelligence Information Security

x Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita...... viii

List of Tables ...... xiii

List of Figures ...... xiv

1. Introduction ...... 1

1.1 Overview ...... 1 1.2 Human Tracking in Small Scale and Large Scale Environments . . .3 1.3 Mobile Robot Tracking in Small Scale and Large Scale Environments5 1.4 Organization ...... 8

2. Related Work ...... 9

2.1 Human Tracking ...... 9 2.2 Mobile Robot Tracking ...... 11

3. Human Tracking in Small and Large Scale Environments ...... 14

3.1 VM-Tracking: Visual-Motion Sensing Integration for Real-time Hu- man Tracking ...... 14 3.1.1 Overview ...... 14 3.1.2 System Design ...... 18 3.1.3 Implementation and Evaluation ...... 29

xi 3.2 EV-Matching: Matching Human Identities on Large Electronic and Visual Data ...... 39 3.2.1 Overview ...... 39 3.2.2 Problem Definition ...... 43 3.2.3 EV-Matching Algorithm ...... 45 3.2.4 Parallelization with MapReduce ...... 52 3.2.5 Evaluation ...... 56 3.3 Summary ...... 60

4. Mobile Robot Tracking in Small and Large Scale Environments . . . . . 61

4.1 S-Mirror: Mirroring Sensing Signals for Mobile Robots in Indoor Environments ...... 61 4.1.1 Overview ...... 61 4.1.2 S-Mirror Design ...... 64 4.1.3 Implementation and Evaluation ...... 69 4.2 BridgeLoc: Bridging Vision-Based Localization for Robots . . . . 76 4.2.1 Overview ...... 76 4.2.2 Background ...... 81 4.2.3 BridgeLoc Design ...... 83 4.2.4 Implementation and Evaluation ...... 95 4.3 Summary ...... 104

5. Final Remarks ...... 105

Bibliography ...... 110

xii List of Tables

Table Page

2.1 Comparison of our system with related techniques. (V: Vision, M: Motion, E: Electronics, A: Acoustics) ...... 11

3.1 Tracking error under different user distances ...... 33

3.2 Tracking error under different user speeds ...... 34

3.3 Tracking error with different user numbers ...... 34

3.4 Tracking error under different user appearances and illumination . . . 34

3.5 Processing time under different user numbers ...... 35

3.6 Processing time under different camera views ...... 36

3.7 Tracking errors of different tracking algorithms ...... 36

3.8 Processing delay of different tracking algorithms ...... 37

4.1 Transmitted data rate in single point localization ...... 72

4.2 Transmitted data rate in multiple requests ...... 73

4.3 Notation...... 103

xiii List of Figures

Figure Page

3.1 V-M integration for human tracking ...... 20

3.2 VM-Tracking system workflow ...... 23

3.3 Motion sensors accelerated human detection ...... 23

3.4 Motion sensors solve occlusion problem ...... 27

3.5 Tracking scenario snapshot from three camera views (red boxes repre- sent occluded persons) ...... 31

3.6 Our primary test field in a building lobby ...... 32

3.7 Tracking errors of VM and V-AF tracking ...... 36

3.8 Processing time of VM tracking (five user tracking) ...... 37

3.9 CDF of successful tracking time ...... 37

3.10 EV scenario ...... 46

3.11 Vague zone of scenario in spatial domain ...... 50

3.12 An example of algorithm for practical setting ...... 50

3.13 EID set splitting workflow ...... 53

3.14 Accuracy vs EID missing ...... 58

3.15 Accuracy vs VID missing ...... 59

xiv 4.1 S-Mirror Real-World Scenario ...... 63

4.2 S-Mirror Conceptual Scenario ...... 63

4.3 S-Mirror node module ...... 65

4.4 S-Mirror node workflow ...... 66

4.5 Localization workflow with motion sensors ...... 68

4.6 Localization performance at different distances ...... 71

4.7 Localization performance in multiple requests ...... 73

4.8 Localization error in continuous localization ...... 74

4.9 Localization delay in continuous localizationd ...... 75

4.10 Transmitted data rate in continuous localization ...... 76

4.11 Illustration of BridgeLoc...... 78

4.12 Vision-based localization using one camera...... 81

4.13 Localization in the basic out-of-FoV case...... 85

4.14 Visual tag structure and code block design...... 88

4.15 Two visual tag examples under circular shifts (0◦-, 90◦-, 180◦ , 270◦- clockwise rotations in turn). Example 1: (a)–(d); Example 2:− (e)–(h). 90

4.16 Maximum of detection distance w.r.t N...... 91

4.17 View rotations in theory...... 92

4.18 System implementation ...... 96

4.19 Scenarios for experimental evaluation...... 98

4.20 Localization error under different robot distances and view angles . . 99

xv 4.21 Ground truth (solid lines) and localized path (dashed lines) of two sample traces in field test...... 100

4.22 Localization success rate under different number of tags in the envi- ronment ...... 101

4.23 Average localization error under different number of tags in the envi- ronment ...... 101

xvi Chapter 1: Introduction

1.1 Overview

Robots are increasingly pervasive in daily life. In 2014, about 4.7 million service robots were sold for purposes such as healthcare, elder assistance, household cleaning, and public service (police patrols and museum tour guides) with sales projected to reach 35 million units between 2015 and 2018 [2, 3]. Mobile robots are expected to work for or interact with humans in various indoor and outdoor applications [6,52,81].

Tracking is a crucial supporting technology for enabling these applications [33,58,68].

Human and mobile robot tracking provides two pieces of key information: continuous locations and identities of tracked objects (i.e., humans and mobile robots). Contin- uous locations form the paths of tracked objects, which reveal places of objects in the past, present, and future. Many applications demand such information. For instance, when a person summons a service robot, the robot needs to know the person’s loca- tion as its destination. In addition, the robot needs to find its location continuously for accurate navigation. Accurate identities are naturally required as normally there are multiple humans and mobile robots in the tracking environment. The tracked objects’ positions are meaningless if their identities are unknown.

1 We study the problems of human tracking and mobile robot tracking individually as each type of object (humans and robots) differs in appearance, mobility, compu- tation and sensing capability, and feasibility of cooperation with tracking systems.

Each type of object has particular characteristics that impact the design of corre- sponding tracking systems. For instance, the shapes of human bodies are generally similar; hence, they can be easily detected and located with visual cameras. However, similar body shapes pose challenges for human identification. Detailed visual features such as facial appearances are required to distinguish different humans. But these visual features require special equipment and are sensitive to practical factors such as varying illumination. Thus, human identification needs assistance from other hu- man characteristics. One viable solution is human identification via people’s mobile devices such as smartphones that they carry everywhere. On the other hand, mobile robots have diverse shapes. Yet robots generally lack humans’ objections to tracking systems (such as privacy concerns) and robots’ on-board capabilities can easily coop- erate with such systems. For example, robots can carry lights that flash in particular patterns to assist visual localization and identification [87].

Besides differences among tracked objects, the scale of tracking systems’ envi- ronments is another key factor that determines the appropriate tracking algorithm.

There are different applications in different scales of environments. Each application has its own requirements that influence the design of the appropriate tracking system.

Thus, different environmental scales entail different requirements of tracking systems.

Small scale environments such as offices are typically living spaces and workplaces that humans and mobile robots inhabit. Tracking systems for these environments need to achieve real-time and lightweight performance. Large scale environments

2 such as campuses span large and complex areas. Tracking systems for these envi- ronments face nontrivial challenges regarding complex backgrounds, noisy data, and numerous tracked objects. Thus, robustness and scalability are crucial requirements for such tracking systems.

In this dissertation, we study human tracking as well as mobile robot tracking.

For each tracking problem, we study its characteristics and develop tracking systems in small scale and large scale environments.

1.2 Human Tracking in Small Scale and Large Scale Envi- ronments

Visual (V ) surveillance systems are extensively deployed in both small scale and large scale environments. They are an important tool for human localization and identification. Human tracking in visual surveillance systems has many practical ap- plications such as visually guided navigation, assisted living, and so on [17, 57, 59].

Despite recent advances in visual tracking research, tracking systems that solely rely on visual information fail to meet small scale environments’ requirements of accurate and real-time performance due to environmental uncertainty and intensive visual fea- ture computation [86,91]. On the other hand, the proliferation of mobile devices (such as smartphones) in the past decade offers an opportunity to track humans based on these devices. Today, almost everyone carries a smartphone, each with unique elec- tronic identifiers (IDs) such as the MAC address of its wireless network card. Using these IDs, we can identify each person based on his or her device. Furthermore, smartphones have various sensors and transmit electronic signals continuously. Thus, continuously localizing humans based on their smartphones’ motion sensors and data transmissions is viable and has attracted considerable attention from industry and

3 academia [1, 24, 46, 49]. In this dissertation, we study human tracking based on inte-

grating visual surveillance systems and humans’ smartphones.

Small scale environments such as offices are typically humans’ living spaces and

workplaces. Thus, tracking accuracy and real-time performance are two important

performance metrics for small scale human tracking systems. Also, humans may ac-

tively participate in the tracking process if their cooperation is not bothersome as

tracking systems assist their lives and work. This cooperative nature, together with

the proliferation of smartphones, enables opportunities for integrating multiple sensor

modalities to improve tracking performance. In light of this, we present VM-Tracking, a human tracking system that achieves accurate and real-time performance [91]. The system aggregates motion (M ) information from motion sensors on humans’ smart- phones and integrates it with visual (V ) data from physical locations. The system has two key features, location-based VM fusion and appearance-free tracking, that distin- guish it from other existing human tracking systems [76,77]. We have implemented the

VM-Tracking system and conducted comprehensive real-world experiments in chal- lenging scenarios. Our results show VM-Tracking’s superior performance in terms of time efficiency and tracking accuracy.

Large scale environments entail tracking in huge physical areas where humans move over long periods of time. Thus, active human cooperation in large scale track- ing systems is not viable as it raises the problems of extensive power consumption, pri- vacy, and other practical issues. However, mobile devices such as smartphones remain an effective tool to identify humans. Widely deployed cellular network infrastructures afford an opportunity to track humans without their cooperation. Smartphones con- stantly transmit electronic (E) signals in the background to maintain connections

4 with nearby cellular base stations. These signals carry the device owner’s electronic identity (EID) such as the International Mobile Subscriber Identity (IMSI) for GSM,

UMTS, and LTE cellular networks as well as the device’s WiFi MAC address and

Bluetooth ID. Surveillance systems use these EIDs in order to identify device owners.

These signals’ received signal strength indicators (RSSIs) also indicate their loca- tions. Thus, we propose EV-Matching [45], a large scale human tracking system that matches E and V data based on their spatiotemporal correlation. In particular, we focus on practical challenges where E and V data do not match perfectly due to en- vironmental noise and complexity. To address them, we design three key algorithms: set splitting, filtering, and match refining. We implement EV-Matching on Apache

Spark [4] to accelerate the process further. We conduct extensive experiments on a large synthetic dataset under different settings. Results demonstrate the feasibility, efficiency, and robustness of our proposed algorithms.

1.3 Mobile Robot Tracking in Small Scale and Large Scale Environments

Mobile robots require more intensive tracking than humans to fulfill their jobs as robots are generally less intelligent than humans. Thus, assistive infrastructure for mobile robot tracking is necessary as robots’ own sensing capabilities are limited.

There is considerable mobile robot tracking infrastructure based on visual surveil- lance [19,31], massive deployments of magnetic strips [10,53] and RFID tags [20,56], and other types of infrastructure [34,50]. Unlike other types of infrastructure, visual surveillance based tracking systems rely on widely available visual infrastructure; hence, massive installations are not required. In addition, visual signals provide rich

5 information about robots and environments. Thus, mobile robot tracking using vi- sual signals offers enormous potential. Unlike humans, robots have various types of on-board sensors and dedicated computation capabilities. They can cooperate with tracking systems to a greater degree than humans. However, mobile robot tracking faces challenges arising from robots and environments. Mobile robots have various shapes and different sensing, computation, and locomotion capabilities. In addition, practical environments require tracking systems to minimize environmental changes yet achieve lightweight performance. In particular, large scale mobile robot tracking needs to be scalable as the robots are expected to work in broad areas. To address these challenges, this dissertation studies mobile robot tracking based on the cooper- ation of visual surveillance systems and robots.

One major challenge to visually track mobile robots in small scale environments is achieving both intensive tracking and lightweight assistance. As robots’ own sens- ing capabilities are limited, intensive tracking normally requires heavy infrastructural support. On the other hand, lightweight performance of mobile robot tracking sys- tems usually entails adoption of simple system designs and fully using robots’ charac- teristics. As mobile robots have dedicated computation ability and high cooperation feasibility, we propose S-Mirror [92] to achieve intensive and lightweight mobile robot tracking. S-Mirror is a novel infrastructure that “reflects” various (mainly visual) am- bient signals towards mobile robots, greatly extending their sensing abilities. S-Mirror forms a network of S-Mirror nodes that mainly reflect visual signals (as well as elec- tronic and acoustic signals) to assist robot tracking. We implement S-Mirror and a mobile robot prototype on commercial off-the-shelf (COTS) hardware. Our real- world experimental validation shows that S-Mirror achieves accurate timely tracking

6 with low network bandwidth consumption as well as robustness and scalability with many mobile robots.

In large scale environments, infrastructural cameras’ views cannot fully cover en- tire areas due to the complexity of practical situations. Thus, purely infrastructural camera based mobile robot tracking systems cannot achieve intensive tracking in these environments. On the other hand, robot camera based tracking approaches exist as well. As visual sensing greatly eases mobile robots’ operations and interactions with humans, most robots are equipped with cameras [38, 39, 82]. Thus, mobile robots could localize themselves by visually recognizing their surroundings. However, robot camera based tracking approaches have poor performance in visually similar areas.

Using visual tags (i.e., visual images with particular patterns) limits the scalability of tracking systems due to the low availability of unique visual tags. We propose

BridgeLoc to enable effective and scalable mobile robot tracking in large scale en- vironments. BridgeLoc is a novel vision-based indoor robot tracking system that integrates both robots’ and infrastructural cameras. Our system develops three key technologies: robot and infrastructural camera view bridging, rotation symmetric vi- sual tag design, and continuous tracking based on robots’ visual and motion sensing.

Our system bridges robots’ and infrastructural cameras’ views to accurately localize robots. We use visual tags with rotation symmetric patterns to extend scalability greatly. Our continuous tracking enables robot localization in areas without visual tags and infrastructural camera coverage. We implement our system and build a prototype robot using COTS hardware. Our real-world evaluation validates Bridge-

Loc’s promise for indoor robot localization.

7 1.4 Organization

The rest of this dissertation is organized as follows. Chapter 2 describes related work. Chapter 3 presents our work on human tracking. This chapter discusses VM-

Tracking, our system for real-time human tracking via visual and motion sensing integration, as well as EV-Matching, our large scale human tracking system based on electronic and visual data matching. Chapter 4 presents our work on mobile robot tracking. This chapter illustrates S-Mirror, our lightweight infrastructure for mobile robot tracking in small scale environments, as well as BridgeLoc, our system that bridges robots’ and infrastructural cameras’ views for tracking in large scale environments. Chapter 5 concludes the dissertation and describes directions for future work.

8 Chapter 2: Related Work

In this chapter, we review existing work related to this dissertation. In general, there are two categories of related work: human tracking and mobile robot track- ing. We discuss human tracking and mobile robot tracking in Sections 2.1 and 2.2, respectively.

2.1 Human Tracking

Human tracking has been a hot research area in recent years. Multiple categories of techniques can be applied to human tracking. In this section, we briefly review the works closely related to our VM-Tracking and EV-Matching systems.

Vision based human tracking: Vision based human tracking can be categorized into two classes: sampling-based tracking such as [40, 71] and tracking-by-detection, like [11, 74]. Shen et. al. [74] proposed a method for camera networks and realized a single object tracking system. et. al. proposed in [90] a system for localizing and tracking multiple people by integrating several visual cues. Efforts are made in applying a-priori constraints to tracking by [40].

Non-vision based human tracking: Electronics based technologies localize wire- less devices at separate time instants, and then put locations together as tracking

9 results [66]. Banerjee et al. [8] proposed to combine Bluetooth and WiFi RSSI read- ings to localize cellphones. Yang et al. [88] proposed an indoor localization system based on WiFi fingerprints without heavy human intervention. Sen et al. [73] ex- plored Channel Frequency Response as fingerprints for indoor localization. All these systems suffer from noises from environment and human bodies [95] to some extent.

Acoustic localization techniques [36,51] can achieve high accuracy under the condition that line-of-sight paths exist between sender devices and receiver devices. Meanwhile, three or more anchor devices are required to cover the same area to perform time-of- arrival (TOA)-based trilateration.

Sensing-fusion based human tracking: Methods utilizing data fusion of different types of sensors are also found in literature. Teixeira et. al. [76, 77] achieve tracking by theoretically solving the problem of trajectory association of vision and motion- sensor, however, assuming that vision algorithm can perfectly detect all human in the view and motion sensors are noiseless. EV-Loc [93] integrates electronic and visual signals for localization. It requires one-to-one matching between electronic devices and visual detections and accumulating E-V data to find the correct associations, which is less error-tolerant and not a real-time approach as ours. EV-Human [48] is the extension of EV-Loc and has the same problems. Fan et. al. [43] proposed a particle-filtering based motion sensor fusion approach for self-tracking. A combination of inertial sensor and camera for self-tracking is introduced in [27]. Roetenberg et. al. [69] proposed a system that consists of magnetic sensors and inertial sensors for motion tracking. NavShoe system [26] is an orientation only tracking system designed to embed in shoes based on wireless inertial sensors and achieves meter-level accuracy.

10 Insensitive Insensitive Commod- to Human Median Methodology to Envi- ity Latency Body In- Accuracy ronment Hardware terference Location based Low, High (0.43m) V-M fusion X X X Real-time Trajectory based Medium V-M High X X (0.8m) fusion [76,77] × Doppler effect High (0.4m) Low, based M-A Medium Real-time fusion [36] × × × ∼(0.92m) Appearance Medium High based V [71,90] X × X (0.963m) E Antenna Low, High (0.23m) array [85] X X × Real-time E RSSI Finger- Low Medium printing [88] × × X (Room-level)

Table 2.1: Comparison of our system with related techniques. (V: Vision, M: Motion, E: Electronics, A: Acoustics)

We summarize the differences of VM-Tracking with existing works in Table 2.1.

We can see that VM-Tracking strikes a good balance among performance and scenario compatibility. With a moderate cost of infrastructure, we are able to achieve high accuracy and real-time performance for dense human tracking in small scale environ- ments. We also compare EV-Matching to existing works on human re-identification [9,

12–14]. Details can be found on [44,45].

2.2 Mobile Robot Tracking

In this section, we mainly review existing vision-based mobile robot tracking ap- proaches related to S-Mirror and BridgeLoc. They can be categorized into two

11 classes using either robots’ cameras [21, 54, 60, 67, 72, 94] or infrastructural cam- eras [23,37,62]. We also review non-vision-based robot tracking [20,34,61].

Robot-camera-based localization: Robot-camera-based approaches rely on robots’ own on-board cameras to localize themselves. In these approaches, robots detect vi- sual features in the environment and localize themselves based on their locations and orientations. Most existing work detects visual features from natural environmental landmarks as artificial visual tags are limited by uniqueness and the detection range trade-off. However, natural landmarks are sensitive to lighting conditions, camera quality, and other practical factors, and may even be absent in areas such as corri- dors. Moreover, existing work requires robots to know visual features’ locations and orientations. Image-based localization [21, 72] assumes these locations and orienta- tions are already stored in a database. Such approaches require cumbersome site surveys and do not scale to large areas. In Simultaneous Localization And Mapping

(SLAM) [54, 60], robots learn environmental features (mapping) and localize them- selves simultaneously. However, SLAM mapping is time-consuming for robots due to their sensing limitations and cannot achieve satisfactory accuracy, especially in the initial stage. Another approach, Visual Odometry (VO) [67, 94], tracks robot motion based on differences between two consecutive images following robot move- ment. This tracking-based approach faces challenges of accumulated drift error and environmental noise. Compared with them, our system enables artificial tags in large areas and achieves high localization accuracy based on integration of widely-available infrastructure cameras and robots’ cameras. We also avoid tedious site surveys and environmental learning as infrastructural cameras naturally provide environmental ground truth.

12 Infrastructural-camera-based robot localization: Most infrastructural-camera-based approaches require robots to carry visual tags and perform localization via visual tag detection. Several visual tag designs have been proposed including ARTag [23], QR-

Code [37], and AprilTag [62]. This work requires infrastructural cameras’ full visual coverage, which is practically infeasible. Existing visual tags are also designed as rotation asymmetric in support of complex visual patterns. Our work bridges infras- tructural and robots’ cameras together to avoid full infrastructural visual coverage.

We also design our visual tags as rotation symmetric, which significantly increases the number of unique tags without sacrificing detection range based on our observation of physically constrained robot motion.

Non-vision-based robot localization: In addition, there are non-vision-based ap- proaches using various sensing signals. Oca˜naet al. [61] design a WiFi fingerprinting scheme for robot localization that achieves meter-level accuracy. DiGiampaolo et al. [20] install passive RFIDs tag in robots for localization, which only works in small areas due to RFID’s limited signal range. Hess et al. [34] propose SLAM using laser detection and ranging (LiDAR) that can achieve high localization accuracy. However,

LiDAR devices are very expensive, which makes their system costly.

13 Chapter 3: Human Tracking in Small and Large Scale Environments

In this chapter, we study the problem of human tracking in small and large scale environments. First, we propose VM-Tracking, a real-time human tracking system in small scale environments. Next, we study human tracking in large scale environ- ments. Specifically, we design EV-Matching, a large scale human tracking system that matches E and V data based on their spatiotemporal correlation. We show that

EV-Matching achieves efficient and robust tracking. In particular, this dissertation focuses on practical challenges arising from noisy and complex environments.

3.1 VM-Tracking: Visual-Motion Sensing Integration for Real- time Human Tracking

3.1.1 Overview

Human tracking in videos provides a direct and context-rich way for localizing humans and analyzing their behavior [74]. It is the enabling technology to a range of applications including smart surveillance, guided navigation, assisted living, etc [17,

57, 59]. For certain applications such as video surveillance, human objects are often being passively tracked; however, there are a plethora of scenarios where humans may actively participate in the tracking process. The cooperative nature of such

14 applications together with the proliferation of mobile devices, enable the possibility of integrating multiple sensor modalities to improve the tracking performance.

Tracking accuracy and real-time performance are two important performance met- rics for practical human tracking systems. Tracking accuracy involves two issues, identity and location. A person needs to be continuously tracked throughout videos despite possible distractions from other moving objects and/or changing environ- ments, i.e. maintaining accurate identity association. Meanwhile, a person’s physical location also needs to be obtained up to certain desired accuracy. Real-time perfor- mance is concerned with the processing latency of a tracking system. For applications such as smart surveillance and assisted living, low processing latency is desirable for prompt responses to unexpected events.

Unfortunately, the state-of-the-art human tracking methods can hardly meet the accuracy and real-time requirements, especially when multiple cameras are employed to cover an extended area and track many objects:

- Tracking accuracy is undermined by environment uncertainty. Many efforts can achieve pretty good performance for a small area over a short period of time. However, in a larger area over a longer time span, the environment changes, e.g., in lighting or human appearances (especially from different cameras), can significantly lower the tracking accuracy.

- Real-time performance is hampered by intensive visual feature computation.

Conventional visual tracking algorithms rely on visual features to associate persons across different frames to form continuous trajectories. Appearance features such as histograms, wavelet coefficients, textures, etc. are of high-dimension and involve intensive computation. Moreover, optimally associating humans across two frames

15 is of a cubic computational complexity (in terms of the number of humans), while associating across three or more frames is an NP-hard problem [5].

A. Novelty of Our Approach

In this work, we propose a novel human tracking system named VM-Tracking, which works by closely integrating visual data (V) with motion sensor information

(M). The new system can effectively address the accuracy and real-time issues con- ventionally associated with visual tracking. There are some existing efforts [76, 77] on integrating visual and motion sensing for human tracking, however, our system is significantly different from them in the following two ways:

- Location-based VM fusion v.s. Trajectory-based VM fusion. Existing efforts in- tegrate visual and motion sensing information based on the similarity of V trajectory and M trajectory. They need to record and track long enough trajectories to ensure the distinguishability among different human objects, which leads to long and unpre- dictable latency. Existing work [77] needs 4.5 seconds to correctly associate V and M trajectories. We propose an efficient visual and motion sensing fusion method based on location proximity. Our VM association is performed every 0.3 second. At each time point, our system matches each visual human object with the motion sensor that shows up at the closest location, and updates the V-M locations in a timely manner. Furthermore, our VM fusion is performed individually, which avoids the inefficiency of global matching that compares all visual human objects and motion sensors. However, the accuracy of our location-based VM fusion is more sensitive to the noise of motion sensor due to the accumulative error. To diminish it, we utilize the robust appearance-free V tracking mentioned below. It accurately tracks every

16 human’s location at every time point, which is a benchmark to significantly reduce the accumulative error.

- Human appearance-free V tracking v.s. Human appearance-based V tracking.

Existing efforts rely on the conventional visual tracking, which tracks humans based on appearance similarities. However, it is error-prone to varying camera views and environments. we propose a human shape based visual human tracking algorithm.

Instead of each human’s specific visual appearance feature like the color ratios, our approach only utilizes the general human shape information which is much less af- fected by camera views and environments. We call this method appearance-free in this work. At each time point, i.e. every 0.3 second, our system detects all human objects in camera views based on their shapes, and calculates their physical loca- tions with calibrated cameras. Since humans are more likely to keep similar velocities within a short time period (around 0.3 second in our system), our system predicts each human’s location based on his previous trajectory and then associates detected human objects with humans based on location proximity. Since we only utilize a gen- eral human shape model, our appearance-free tracking algorithm is robust to varying camera views and environments. However, our system needs to search over dense im- age locations and multiple scales to detect human objects in every 0.3 second, which is potentially a huge computation burden. We avoid this exhaustive search based on the timely location-based VM fusion discussed in the above paragraph. The objects’ locations and sizes in the image can be predicted by associated motion sensor, which significantly reduces the searching workload.

B. Our Contributions

17 With the above two key features, i.e. location-based VM fusion and appearance-

free V tracking, our VM-Tracking system can achieve real-time and accurate human

tracking. This work presents a detailed discussion on the design, implementation and

evaluation of the VM-Tracking system. We claim the following main contributions of

this work:

We propose an appearance-free visual human tracking system that is the first • to efficiently and accurately track multiple humans over a large area covered by

multiple cameras.

We propose an efficient visual-motion sensing integration approach, and design • VM-Tracking system with three key modules: location-based human detection,

appearance-free object association, and tracking loss recovery.

We implement the VM-Tracking system with further boost and enhancement, • including processing modules pipelining, GPU accelerated human detection and

electronic check-point.

We evaluate system performance with various system settings in terms of ac- • curacy and time, as well as the performance under realistic tracking scenarios.

Our tracking system is able to achieve real-time, with less than 0.5 second delay

and 0.43 meter error.

3.1.2 System Design

A. Design Rationale

The objective of our system is to accurately track multiple human objects in real- time within a large space. Despite the recent advance in visual tracking research, the

18 accuracy and real-time performance of the state-of-the-art visual tracking algorithm is still far from perfect [84]. Visual tracking of multiple human objects faces even greater challenges when we have multiple cameras for a large area, as required in our targeted application. For our application, we argue that camera-only tracking techniques cannot meet the accuracy and real-time requirement at the same time, based on the following observations: 1) Visual appearances of the same object across different camera views may significantly differ, which requires complicated and hence heavy computations to handle properly; 2) Visual occlusions can hardly be resolved with a single camera, which may cause erroneous tracking. Employing multiple cam- eras with overlapped field-of-view to resolve occlusions introduces large computation burden, and thus hampers real-time performance. 3) Appearance models are usually high-dimensional, resulting in high computation complexity. Existing VM fusion ap- proaches [76, 77] that rely on conventional visual tracking also suffer from the same problem. Besides, their trajectory-based VM fusion is inefficient because recording and tracking unpredictably long trajectories are required to ensure the distinguisha- bility among different human objects.

In this work, we incorporate the motion sensor measurements of a mobile device as a new measurement dimension, in order to address the dilemma of accuracy and real- time inherent to the existing tracking methods. Visual and motion sensor information is closely integrated in our novel tracking methodology named VM-Tracking. Fusing all motion sensor readings enables me to estimate the user’s instantaneous velocity and location. The integration of visual and motion sensor information as proposed in this work overcomes the accuracy and real-time problem faced by conventional visual tracking and existing VM fusion tracking systems due to the following three reasons.

19 Figure 3.1: V-M integration for human tracking

First, human detection from individual video frames can be greatly accelerated with the aid of motion sensor information. For conventional visual and existing VM fusion approaches, detecting human objects from a video frame requires searching over dense image locations and multiple scales, and therefore, the computation is intensive. With the motion sensor information, exhaustive search is not necessary.

Given the previous human detections in a video frame and their motion estimated from the motion sensor readings, the objects’ locations as well as their image sizes in the frame can be predicted, as the dashed circles shown in Figure 3.1. Only local search around the predicted image regions and limited image scales is required for

finding the human objects in the current frame. The reduced searching space improves real-time performance without loss of accuracy.

Secondly, location-based visual and motion sensor integration provides an effective solution to track human objects across different frames. Human objects are considered as multiple moving points based on their physical locations. We choose to use motion sensor information as the evidence for associating multiple human detections that belong to the same person across different frames. In light of the spatial limitations of

20 human objects, our tracking system utilizes physical location proximity and velocity to measure the likelihood for associating human detections. Figure 3.1 illustrates two such cases. The physical location based association enables our system to track each person individually, which is more scalable than trajectory-based VM fusion algorithms.

Thirdly, tracking loss can be recovered with motion sensor information. In prac- tical scenarios, it is common for tracked objects to undergo long visual occlusions.

With motion sensor information, the locations of human objects can be continuously acquired. Similar locations of a visual object can be associated to form a continuous tracking result.

B. Workflow

Figure 3.2 shows the workflow of our VM-Tracking system. Different tracking procedures are illustrated using different colors and line types.

First, we have multiple video cameras covering a large area with very small over- lapped areas to form a continuous tracking area. Our motion data come from the motion sensors of users’ mobile phones and are denoised locally. All the V and M data are collected by a central server for further processing. We identify the motion data coming from the same mobile phone with an electronic identifier such as WiFi

MAC address, called MID. Meanwhile, we have detected visual objects in the video frames, called VID. Our VM-integrated human tracking is based on the association of the MIDs and VIDs. The association uses physical proximity.

The MID location is estimated by its location in the last frame and its motion dur- ing the current frame. If no previous location is available, it means this MID has not been initialized with an VID or it has got lost. We will discuss the initialization/loss

21 recovery module later (in green dashed lines). On the V side, human detection is performed on the current video frame with the acceleration of motion information.

If there are no detected VIDs (for example caused by occlusion), we simply use the estimated MID locations as the tracking results in the current frame, and then go to the next frame (in orange dotted lines). If there are detected VIDs, we compute their physical locations by calibrated cameras. Next, we integrate the MID and VID based on the physical proximity. Once the similarity of a VID and a MID exceeds a certain threshold, we consider them a valid association and update their locations using the

VID location. After the update, we continue to the next frame.

Besides the normal tracking module, we have another workflow for the initializa- tion/loss recovery module (denoted as the green paths in Figure 3.2). We determine a tracking loss under the condition of no valid VID-MID association. The initialization is same to the loss recovery, whose purpose is to find a correct VID for a “new” MID.

Our initialization/loss recovery is based on V-M association filtering. After assuming a large candidate VID set, we gradually filter away impossible associations and finally reach the unique correct VID with theoretical guarantee.

Next, we discuss detailed designs of three key modules: human detection, object association and loss recovery.

C. Motion Sensor Accelerated Human Detection

In the proposed VM-Tracking system, the instantaneous locations of tracked ob- jects are obtained using visual human detection accelerated by motion sensor infor- mation. The human objects are delineated by bounding boxes in a video frame, and their physical locations can be easily calculated using projective geometry model given calibrated cameras. Note that such visual detections are yet to be associated

22 Figure 3.2: VM-Tracking system workflow

with individual object identities in order to form continuous trajectory, which will be described in detail in the following subsection.

Figure 3.3: Motion sensors accelerated human detection

Detection performance of state-of-the-art visual human detection algorithms has now reached some level of maturity, in that human objects can be reliably detected from nature scenes with high detection and low false alarm rates [18, 22]. However, due to the nature of the underlying computation model, visual detection often in- curs significant computational cost. The visual detection process involves exhaustive

23 search, i.e. sliding a window across an image at multiple scales and classifying each lo- cal window as the target or background. For practical applications, video data from multiple cameras need to be processed simultaneously in real-time, which imposes heavy computational demand on the video system.

We propose to utilize motion sensor information to avoid exhaustive search, and thus provide a computationally efficient solution to visual human detection under cooperative settings. Motion sensor information from a target is used to predict its motion between two consecutive frames. Given a target’s previous location, the possible area within where it is likely to appear can be determined based on the error model of motion sensors. The corresponding region and size in the image domain are determined given the camera calibration model. Sliding window search can now be confined within a local neighborhood of the spatial-scale space, as illustrated in

Figure 3.3.

The image pyramids are images at a sequence of different scales for multi-scale object detection. Usually the detection goes through all the pyramid levels. However, the largest scale and the smallest scale cost very different times. So it becomes necessary to save time on large scales if they do not probably contain the detection results. With the motion information, we reduce the number of pyramid levels for detection by predicting the possible pyramid levels the target appears in detectable scales. To estimate the pyramid level at which the detection performs, we assume

H is the homogeneous transform matrix obtained from camera calibration. At time t 1, the left bottom corner and right bottom corner of the bounding box is denoted − as x0 and x1. The width of the bounding box is w. Then the physical locations of them in real world is computed as X0 = x0 H and X1 = x1 H, respectively. · ·

24 0 0 Their next location X0 and X1 at time t is predicted using the motion information provided by motion sensor. The predicted real world locations are mapped back to image plane in pixel coordinates, from x0 = X0 H−1 and x0 = X0 H−1. Then the 0 0 · 1 1 · width w0 of the new predicted bounding box is estimated as the Euclidean distance

0 0 between x0 and x1. Based on the error model of motion sensors, this value may vary in [w0 n, w0 + n], in which n is the noise introduced by the error. The ratio of w and − w0 n is the estimated range of ratio of current pyramid scale and predicted pyramid ± scale.

D. Motion Sensor Assisted Appearance-free Object Association

We utilize the location proximity to associate human detections at consecutive time steps based on motion sensor information. For consecutive frames t 1 and t, − multiple persons are visually detected and their physical locations can be calculated

given the camera’s calibration data. We represent the set of visually detected persons

at time t 1 as A = i=1,...,M , and at t as B = bj j=1,...,N , respectively. Further − { } { } denote the physical location of the i-th person seen from the visual camera at time

t 1 as v(ai), and similarly the j-th at time t as v(bj). For ai at time t 1, we can − − know his speed from the visual cameras, as well as from the motion sensor. On the

visual camera side, we can estimate his moving speed, and calculate his estimated

location at time t, which is denoted asv ˜t(ai). Similarly, on the v sensor side, we

can use acceleration information to estimate his average moving speed and calculate

his estimated location at t, denoted asm ˜ t(ai). Assume we know the actual speed distribution, then we can calculate qij (the matching probability of ai and bj based on visual estimations) and rij (the matching probability of ai and bj based on motion

estimations) as

25 qij = (ai = bj v˜t(ai), vt(bj)) |

rij = (ai = bj m˜ t(ai), vt(bj)) |

Suppose the visual estimation has a standard deviation of σv, and the motion

estimation has a standard deviation of σm, we calculate the fused probability pij as

2 2 2 2 2 2 σv/(σv+σm) σm/(σv+σm) pij = q r ij · ij With the above probability, we can gradually filter away impossible matching based on a threshold learnt offline. We perform the filtering over rounds until a one-to-one matching result is found.

When multiple objects are in close proximity or mutually occluded, we resort to the motion sensor information for resolving the ambiguity arising from visual do- main. When two persons approach each other and undergo visual occlusion within a camera’s view, the visual detection algorithm will fail to generate two separated de- tections. Conventional visual tracking methods will very likely fail, and may produce switched target identities after they separate, if two objects are visually similar. With the aid of motion sensor, our method can handle the occlusion case. Once the objects separate and produce multiple visual detections, the motion direction as estimated from motion sensors serves as the hint for associating a visual detection to the one in the previous time step, as illustrated in Figure 3.4.

E. Motion Sensor Guided Loss Recovery

Ideally, our system keeps tracking the human objects all the time. However, track- ing losses could happen due to many complex reasons. It is not easy for existing visual

26 Figure 3.4: Motion sensors solve occlusion problem

tracking methods to handle this issue. Trajectory-based VM tracking algorithms have

to start over, which significantly hurts their real-time performance. In this light, we

propose a location-based loss recovery mechanism. When a MID loses its associated

VID, our tracking system starts the tracking loss recovery module by assigning this

MID to all possible VIDs in the view. The idea is to filter out over time impossible

associations with continuous V-M tracking until the only correct VID is left.

On a more technical level, we have several sequences of visual locations over time

(from different persons) from the cameras, and a sequence of acceleration readings (on

one person) from the motion sensors. We want to tell if one of those visual sequences

corresponds to the motion sequence. For one person, we get a sequence of velocities

from the visual side. Denote it as Pi , and Pi is calculated between time point Ti−1 { }

and Ti. On the other hand, we integrate accelerations on the M side to get the velocity

sequence for a specific MID, and denote it as . Qi is integrated from Ti−1 to Ti. { }

It is worth noting that Pi and Qi are noisy data. The noise of Pi and Qi { } { } { } { } can be considered i.i.d. Normally, these noise can be modeled as Gaussian. If Pi { }

27 and Qi are generated by different persons, their average difference will be strictly { } greater than zero after enough observations. In this situation, we have E[Pi Qi] > 0. − Pn i=1 Pi−Qi We denote Pi Qi = . We have the following theorem. − n

Theorem 1. For two sequences Pi and Qi with Pis and Qis bounded, E[Pi] > { } { }

E[Qi], c (0, 1), ε (0,E[Pi] E[Qi]): N, ∀ ∈ ∀ ∈ − ∃  P Pi Qi > ε > c, (n N). − ≥

Theorem states that if the V and M observations are from different people, we will eventually be able to tell with enough observations. If we model the noises of V and

M observations as Gaussian, we will be able to accurately tell how much confidence we have of telling them apart (ε and c) after how many rounds (N). Due to space limitation, we skip the proof of Theorem 1 and detailed calculations.

Theorem 1 gives the theoretical foundation that we are able to associate lost

VIDs with correct MIDs. However, there is still another problem, the state explosion problem. We will match MIDs with every VID in every frame. In an ideal scenario where nobody enters or leaves, suppose we have D frames, and N MIDs. We will have a total number of combination at N D. This will significantly hurt our real-time performance as D gets larger. Our location-based object association method actually solves this problem efficiently. Motion sensors also give the displacement constraint of each human object over a fixed time period. Assume the association threshold is

τ. We can guarantee that no VIDs that exceed τ will be filtered away in the course, because sub-trajectories carry larger probabilities than the entire trajectory. The following lemma and theorem are self-evident.

28 Qn Qb Lemma 1. Pi Pi (1 a b n). i=1 ≤ i=a ≤ ≤ ≤ where Pis are all probabilities, i.e., 0 Pi 1. Theorem 2 follows. ≤ ≤

Theorem 2. If a complete trajectory’s probability exceeds τ, any sub-trajectory of this trajectory will have a probability exceeding τ.

3.1.3 Implementation and Evaluation

In this section, we present our system implementation, report our experiment results on VM-Tracking system and show the system’s performance on accuracy and time.

A. Implementation

The implementation of our system has two main components: a front end that collects and transmits video images (V) and motion sensor data (M); a back end that receives V and M data and performs human tracking.

We use commodity cameras (D-LINK DCS-930L) to shoot the area of interest.

We set the V data rate to 3 frames per second which is a very low video rate to demonstrate the efficiency and robustness of our system. We implement an Android application to collect and transmit motion sensor data from users’ mobile phones

(Nexus S). The M data rate is set to 5 readings per second. Raw motion sensor data contains huge noise that causes error accumulation and tracking derailment. In our system, we apply Google’s denoising technique (Rotation Vector Sensor) to improve the accuracy of M data based on the rotation vector sensor on mobile phones.

Our back end server is equipped with an Intel Xeon E5 CPU, two NVIDIA GTX

760 GPUs and 8 GB memory. The received M data are inserted into a OurSQL database. For the V data, we perform human detection on the frames and compute

29 physical locations of the detected VIDs with a state-of-art algorithm called DPM [22].

Then, VM-tracking is executed based on the V and M data in database, following the workflow as we discussed before. Finally, we implement a Java GUI to show video frames live with the human objects marked with associated EIDs.

Next, we highlight three key components we implemented for system efficiency:

- Processing Modules Pipelining. We pipeline the whole tracking process where every module greedily starts working once its inputs are ready. The real-time performance depends on the slowest module (which is usually the human detection).

If the slowest module is faster than the V and M data input speed (1/3 seconds), we say our system is real-time. In addition, the tracking delay of our system is determined by the critical path of the tracking process, which is along the V data processing modules actually.

- GPU Accelerated Human Detection. We utilize GPU to accelerate human detection. Specifically, we accelerate three steps of the human detection algorithm

(i.e., DPM in our system): pyramid building, feature generation and filtering. These three steps take 99% of the overall detection time. We notice that these steps are parallelizable, so we can take advantage of GPUs on parallel computing.

- Electronic Check-point. We employ electronic check-points to associate an

MID with a VID to initialize tracking based on the observation that when a person approaches a wireless sensor (e.g. AP), say less than 1 meter away, the RSSI readings of his mobile phone are strong and robust against environmental noise. So we directly associate that phone with the person standing near the AP that is observed by our cameras. The small area around the AP becomes a check-point.

B. Experiment Setup

30 Figure 3.5: Tracking scenario snapshot from three camera views (red boxes represent occluded persons)

We have set up two different test fields to evaluate our system: a large lobby

(Figure 3.6) and a small office. The large one has a complex background and non- uniform illumination. Six cameras were deployed to cover the whole area. The small one has static lights with two cameras deployed. We set up electronic check-points at the entries of these two test fields. We had 5 participants in our experiments for tracking. Each participant took an Android phone in the hand or in the pocket and walked around freely in the experimental area. We also had several other persons to emulate non-users, who were walking around without mobile phones.

We evaluated our system using two metrics: tracking error and processing time.

The tracking error is defined as the average distance between a user’s trajectory determined by our system and his ground truth trajectory. Due to human detection box drift, there is a small error of the visual localization, which is around 0.35 meter according to our measurement on data samples. The processing time has two aspects: processing delay and update interval. The processing delay is the time that our system

31 Figure 3.6: Our primary test field in a building lobby

spends to update a user’s location after the user moves. The update interval refers to the time between two consecutive location updates.

We compared three tracking algorithms: our system (VM), appearance-free track- ing (V-AF) and Incremental Visual Tracking (V-IVT) [71]. The VM algorithm is our visual-motion sensing integrated tracking algorithm. The V-AF algorithm is the physical-location-based, appearance-free visual tracking. Its difference from the VM algorithm is that the V-AF algorithm does not use the motion sensor data. The IVT is one representative conventional tracking algorithm in the computer vision.

C. System Performance under Controlled Settings

In this subsection, we show the influence of scenario parameters on our system’s performance in terms of accuracy and time.

Accuracy: We measure the following factors that could affect our tracking accuracy: user distance, user speed, user number, user appearance and environmental illumina- tion.

32 User distance 0.6 m 1.8 m 2.4 m Tracking error 1.08 m 0.38 m 0.35 m

Table 3.1: Tracking error under different user distances

– User distance. User distance affects the occlusion degree of the users. We let two users move along two parallel lines in opposite directions back and forth under one camera’s view. As the distance between the two lines increases, there is less occlusion. The tracking errors are shown in Table 3.1. When two users are very close to each other (0.6 meter distance) with severe occlusion, our tracking system has a tracking error at around 1 meter. This demonstrates the effectiveness of the motion sensor data on helping distinguish two users with their different moving directions.

– User speed. User speed affects the movement of the users in two consecutive frames. We let three users move freely in the large test field with different speeds. The average tracking errors are shown in Table 3.2. We can see that our tracking system never losses tracking of the users with slow or normal speeds. The tracking error slightly increases as the user speed increases. This shows that our appearance-free tracking based on physical location proximity works well with normal human walking speeds.

– User number. User number affects the density of the users in the tracking area. We let different numbers of users move freely in both the small and the large test fields. The results are shown in Table 3.3. Our system performs well with a dense human crowd, e.g., 3 users in the small field and 5 users in the large field.

33 User speed (m/s) 0.5 0.7 (slow) 1.0 1.5 (fast) Tracking error 0.35 m 0.35 m 0.35 m 0.57 m

Table 3.2: Tracking error under different user speeds

User number (small field) 1 2 3 Tracking error 0.35 m 0.35 m 0.40 m User number (large field) 1 3 5 Tracking error 0.35 m 0.35 m 0.44 m

Table 3.3: Tracking error with different user numbers

– User appearance and illumination. We performed the experiments in both the small and the large test fields (with different illumination). We let the users change their clothes to make their appearances very similar or very different. The results are shown in Table 3.4. Our system never losses track of the users in any of the four test situations. Note that the 0.35 meter error comes from the visual localization noise. In other words, our system is appearance-free and robust to the change of environmental illumination.

Same Appearances Different Appearances Uniform Illumination 0.35 m 0.35 m Non-uniform Illumination 0.35 m 0.35 m

Table 3.4: Tracking error under different user appearances and illumination

34 User number 1 2 3 Processing delay 374 ms 437 ms 512 ms Human detection time 159 ms 232 ms 273 ms

Table 3.5: Processing time under different user numbers

Time: We measure the following factors that could affect the real-time performance: user number, and scale of camera’s view.

– User number. User number affects the processing time of human detection, which is slowest module in our VM tracking procedure. Table 3.5 shows the overall processing delay and the human detection time under different user numbers. As user number increases, both processing delay and human detection time increase.

But their values are small. As we discussed before, if the slowest module is faster than the data rate (3 fps), we know that our tracking system is real-time.

– Coverage of cameras. The depth of the area covered by the camera directly determines the variation range of a person’s size shown in the image, which will have impact on the range of pyramid level selection in motion sensor accelerated human detection. The time performance under different user numbers is shown in Table 3.6.

The smaller the area is covered, the less the human scale changes in the image, thus the time cost will be reduced.

D. System Performance under Realistic Settings

Next, we show the performance of our system with realistic experiment settings.

All the experiments were conducted in the large test field.

Single User Tracking: For the single user tracking experiment, we let one user walk around in the large test field for 5 minutes. We purposely let another two non-users

35 Camera view Narrow Wide Processing delay 371 ms 480 ms Human detection time 157 ms 276 ms

Table 3.6: Processing time under different camera views

1-user tracking 5-user tracking VM 0.42 m 0.43 m V-AF 2.33 m 3.50 m V-IVT 4.44 m 3.96 m

Table 3.7: Tracking errors of different tracking algorithms

16 9 VM: Instant VM: Instant 14 8 VM: Average VM: Average 12 V−AF: Instant 7 V−AF: Instant V−AF: Average V−AF: Average 6 10 5 8 4 6 3 Tracking Error (meter) 4 Tracking Error (meter) 2

2 1

0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Time (second) Time (second) (a) Single user tracking (b) Five user tracking

Figure 3.7: Tracking errors of VM and V-AF tracking

to walk through the test field. We repeated this experiment 5 times. The tracking errors are shown in Figure 3.7(a).

36 1-user tracking 5-user tracking VM 475 ms 472 ms V-AF 579 ms 654 ms V-IVT 870 ms 928 ms

Table 3.8: Processing delay of different tracking algorithms

700 Data Collection 1 Human Detection 600 V−M Assoication Total 0.8 VM 500 V−AF 400 0.6 V−IVT

300 0.4

Procesing Time (ms) 200 Cumulative Percentage 0.2 100

0 50 100 150 200 250 0 0 50 100 150 200 250 Elapsed Time (second) Time (second)

Figure 3.8: Processing time of VM track- Figure 3.9: CDF of successful tracking ing (five user tracking) time

Our system is able to continuously track the user. There is only one tracking loss in the trace shown in Figure 3.7(a), which happened at around 245th second. It means we achieve almost 100% correct tracking during the 5 minutes. Furthermore, the tracking loss got recovered soon within 5 seconds. The overall average tracking errors are only 0.42 meter. On the other side, we observe more losses in the V-AF tracking. The recovery of the V-AF tracking usually costs more time. The V-AF tracking may totally lose track, for example, at the 240th second. On the opposite side, our VM tracking can always recover from tracking losses based on the diversity of movement of different persons. The V-IVT performs the worst. It usually got lost

37 in about 20 30 frames (about 10 seconds), and can not get recovered from the ∼ tracking losses.

For the time performance, we summarize the processing delay in Table 3.8. The improvement by utilizing motion sensor data is about 100 ms. Our VM tracking is real-time. The detailed processing times of individual tracking components are similar to the multiple user tracking case (as shown in Figure 3.8). We will discuss it in detail later.

Multiple User Tracking: For the multiple user tracking experiment, we let 5 users walk around in the large test field for 5 minutes. The area was dense with 5 users, and some occlusions are shown in Figure 3.5. We repeated this experiment 5 times.

The tracking errors are shown in Figure 3.7(b).

We can see that more tracking losses happened in the 5-user tracking case. How- ever, our VM tracking still performs very well in the 5-user tracking. The overall average tracking error is similar to that in the single user tracking case. On the other side, the performance of the V-AF tracking degrades significantly as the number of tracked users increases from 1 to 5. We conclude that our motion-assisted tracking system is much more robust with a dense human crowd. We also show the cumula- tive density function of the successful tracked time (i.e. the first loss time) of V-IVT,

V-AF and our VM tracking in Figure 3.9. The VM algorithm has a much longer successful tracking duration. Also note that even after tracking lose takes place, VM has ability to recover.

The time performance of our VM tracking is shown in Figure 3.8. First, the processing delay is relatively stable during the 5 minutes period, with the average value at 472 ms. Second, the slowest component (human detection) keeps below 300

38 ms, which is smaller than the frame interval of 333 ms. With our pipelined processing mechanism, our VM tracking is real-time. On the other side, we observe the V-AF tracking suffers from the dense crowd and V-IVT has the highest processing delay.

3.2 EV-Matching: Matching Human Identities on Large Elec- tronic and Visual Data

In this section, we present our large scale human tracking system called EV-

Matching. This is a joint work with Dr. Gang Li. We define the problem of EV-

Matching together and differ at algorithm design. Dr. Li designs algorithm for ideal setting, and I extend and improve it to practical setting. As my work is based on

Dr. Li’s work, I will present same overview and problem definition as Dr. Li’s dissertation [44], briefly introduce his algorithm and then present my work. At last,

I will demonstrate parallelization and evaluation of the work.

3.2.1 Overview

Visual surveillance is a major way to monitor people’s locations, behaviors and activities. Surveillance cameras are widely deployed in public places, especially in safety-sensitive areas, and the amount of cameras is huge. According to reports, there is one surveillance camera for every 14 people in Britain, and at least 24, 000 cameras have been installed in Chicago [28]. Surveillance videos have become the biggest big data we need to deal with [35].

Electronic data also plays an important role in surveillance. Currently, there are roughly 6 billion active cell phones in the world [65]. Almost every person has one or multiple electronic devices, including mobile phones, tablets and laptops. These de- vices constantly emit electronic signals to communicate with network infrastructure.

39 These signals carry the holder’s electronic identity (EID), such as IMSI for GSM,

UMTS and LTE, WiFi MAC and Bluetooth ID which can be utilized to locate and

track the device holder in surveillance systems.

For surveillance applications, one of the major problems is how to precisely de-

termine human objects’ identities among different surveillance scenes. Traditionally,

visual (V ) data and electronic (E) data are processed separately for such purpose.

However, V data and E data are imperfect alone for information gathering and re- trieval. Yet they are highly complementary to each other for surveillance purposes.

V data is intuitive and accurate but searching through the massive videos, either by human operators or through computer vision techniques, is inefficient. Compared with V data, E data is relatively lightweight, and electronic surveillance based on localization and tracking is readier to be carried out. But the propagation of E sig-

nal is highly dependent on a good transmission environment. The range error of E localization is relatively large. According to the characteristics of E and V data, we can merge the good of both worlds for efficient large-scale surveillance.

In this dissertation, we study the combination of E data and V data for better surveillance. Informally put, we want to find the visual images (visual identity or

VID) of a person carrying a specific mobile devices in massive surveillance videos.

For example, a crime happened and the police have the EIDs appearing around the crime scene when it occurred. They want to figure out the activities of these EIDs’

holders in surveillance videos over previous months in order to find the suspects. A

straightforward way is to ask the service provider such as the telecommunications

company to provide a photo linked to an EID and then search through the surveil-

lance videos. However, not all EIDs are registered such as WiFi MAC. Technically

40 we intend to match the EIDs in electronic location logs with VIDs in surveillance

videos. With this matching, we are further able to fuse these two big and hetero-

geneous datasets, and retrieve the E and V information for a person at the same

time with one single query. We call this EV-Matching. Spatiotemporal information

is what these two datasets share in common and what we utilize to perform the -

sion. However, electronic locations collected with wireless positioning are easy to

process but inaccurate. Camera positioning is accurate but computational-intensive.

It is impractical to find a person by processing all the videos. To deal with these

two EV datasets which are hugely diversified in quality and processing, a two-stage

E-filtering and V -identification strategy was proposed to reduce the visual process- ing burden and get accurate matching results in [78]. However, this method only works on single EID-VID matching and it does not work in fusing big EV datasets.

Naive parallelization of this algorithm to make multiple machines work on different

EID-VID pair brings no further benefit because of duplicated computation during

the matching.

We design distributed algorithms to streamline efficient and accurate EV-Matching.

We develop the EID set splitting algorithm to reduce the processing burden of V

data. The E location dataset is firstly processed and a fraction of video frames where

each EID may appear are selected. Then the common VIDs appearing in the video

frames filtered in the previous stage will be figured out as the matched VIDs of the

corresponding EIDs. We call this stage VID filtering. With EV-Matching, multiple

EID- VID pairs can be matched at the same time. In fact, a large portion of the

video frames selected with the EID set splitting algorithm can be reused for other

41 EIDs. This will further reduce the processing burden of V data. We extend the

whole procedure to a distributed version and implement it with MapReduce.

Moreover, our EV-Matching algorithm also supports elastic matching sizes. This

is very important for surveillance because the EID-VID range of interest may be

different, from a single person to the entirety of people in a city. With this algorithm,

we can find the VID corresponding to one specific EID without matching other EIDs and VIDs. We can also choose to match multiple EID and VID pairs simultaneously,

or even achieve universal dataset matching. Universal matching is the extreme case,

which actually gets each VID in the whole videos labeled with its corresponding EID.

After universal labeling, it will be more efficient to do future queries because all the

EV raw data has been processed and indexed. Note that the larger the matching size

is, the less time it costs per EID-VID pair.

Though our goal is to explore the problem of fusing large surveillance visual data

and electronic data, we start with a simplified and clean model. Our EV-Matching

algorithm assumes the EID and VID are both readily available for the person of interest. However, such assumption is not always true in real-world environment. For example, some people do not carry any electronic device so they have no EIDs. Also,

due to occlusion or imperfect vision algorithms, we may fail to extract the VIDs for

some people. In such situations, we adapt our algorithm and try to get the best

matching results. Although human intervention may be involved, our EV- Matching

can still shoulder human’s workloads.

In short, we have the following key contributions:

42 We design the EV-Matching algorithm to fuse big EV datasets for efficient • large-scale surveillance. EID set splitting is developed to reduce the visual

processing burden.

The whole matching procedure is extended to the MapReduce framework. We • utilize the mechanism of (key, value) shuffle in MapReduce to implement the key

operation in our algorithms. This design not only improves the time-efficiency

in a parallel way, but also reduces the visual processing burden because of the

overlap in filtered video frames.

Our algorithm supports single, multiple and universal EID-VID matching. Mul- • tiple EIDs can be matched to their corresponding VIDs simultaneously. Even

universal labeling can be performed in an efficient way.

We implement our algorithm on Apache Spark [4]. Large-scale evaluations • are conducted on a 14-node cluster and the results demonstrate the matching

accuracy and efficiency of our algorithm.

3.2.2 Problem Definition

Electronic data and visual data are pervasive and both of them can be leveraged to capture people’s positions in public places. For example, base stations can sense and estimate a mobile phone user’s location, which can be called E-Location. Meanwhile,

public cameras can capture human figures and estimate their real-world locations,

which can be called V-Location. Within a period of time, such as one day, one EID’s

E-Locations accumulate and an entire E-Trajectory is generated. V-Trajectory is a

linkage of the V-Locations of a single person with human re-identification or visual

43 tracking methods. Then one person has one E-Trajectory, if the person uses only

one phone in this period of time, and multiple V-Trajectory segments, because of

occlusions and appearance variations. If we want to match some EIDs with their

corresponding VIDs, we need to associate E-Trajectories and V-Trajectories of a group of people in large temporal and spatial scale.

Next, we describe the data we consider in this dissertation, and the problem we are trying to deal with.

A. Data Description

Raw E-Data: It contains EIDs (e.g., IMSI, WiFi MAC) and timestamps when • these EIDs were captured. E-Locations can be estimated using the position of

the devices or base stations that capture these EIDs, or using other localization

methods if more information is available, such as electronic signal strength.

Raw V-Data: Basically, it contains timed video data. VID and their locations • can be extracted from videos. Such extraction can be very time intensive.

B. Problems

We have a large amount of E-Data and V-Data generated by the same group of people across different time periods within a large area. We are trying to deal with the following problems:

Matching: How to match the same person’s EID and VID. To begin with a • clean matching model, we have the following assumptions:

1. VID consistency: In a period of time, we can successfully extract the

same VID with some methods (such as appearance similarity) with a high

probability.

44 2. VID completeness: When the EID of a person is recorded, we assume the

VID of the same person is also in V dataset.

From the first assumption, we can connect the V-Locations of the same person

to generate the V-Trajectory. Meanwhile, we have each EID’s location records

at different timestamps. This kind of spatialtemporal information is able to dis-

tinguish one EID’s large-scale trajectory from the others’ (two people are rarely

at the same position all the time). Based on this and the second assumption,

we can find the VID which is most likely to have the same trajectory with this

EID, and matching is accomplished.

Parallel Processing: How to parallelize the matching procedure to match mul- • tiple people at one time on a large-scale EV dataset. This is also a big data

problem. We aim to implement our algorithm in MapReduce framework which

is one of the most popular frameworks for big data processing. When using

MapReduce, we need design proper map and reduce functions to fit our match-

ing algorithms into the distributed framework.

The problems of matching and parallel processing will be addressed in Section 3.2.3 and 3.2.4, respectively.

3.2.3 EV-Matching Algorithm

In this section, we first explain how we try to tackle this Matching problem. Next, we propose algorithms for practical settings, followed by necessary analysis.

A. Preliminaries

Directly extracting all information such as E-Locations and V-Locations for match- ing is too time consuming, which is undesirable. To mitigate this problem, we propose

45 to use “rough” E-Locations and V-Locations to select a small portion of the whole data, and perform detailed extraction only on this portion of data. So we introduce the concept of EV-Scenario.

Figure 3.10: EV scenario

Definition 1. EV-Scenario is a snapshot of the EID and VID sets appearing in a

specific spatial region at a single time point. It is comprised of E-Scenario (with EID

set only) and V-Scenario (with VID set only).

We divide the whole spatial region into a bunch of smaller regions that we call

scenarios. A scenario can be the region covered by one camera, the region of one room

covered by several cameras or even a hexagonal cell if we generate the view of the

whole region by combining the views of all cameras and divide it uniformly as shown

in Figure 3.10. As described in earlier, all E-Data and V-Data are embedded into

EV-Scenarios. At a single time point, EIDs appearing in one scenario make up an

E-Scenario. An E-Scenario’s corresponding V-Scenario is the set of VIDs appearing

simultaneously in the same scenario.

An E-Scenario together with its corresponding V-Scenario comprise an EV-Scenario.

46 After introducing the concept of EV-Scenario, matching EID and VID by compar- ing the whole E-Trajectory and V-Trajectory becomes matching the “sub-trajectories” of EID and VID. That is, to find a list of EV-Scenarios such that only one EID and

one VID appear in all these EV-Scenarios, and we can match them safely.

Algorithm 1 Algorithm for ideal setting MAIN() ES = SetSplit( Ueid ,Ssce) { } return V F ilter(ES)

SetSplit(Peid,Ssce) while Peid < n & Ssce = do k k 6 ∅ select C from Ssce Ssce = Ssce C \{ } Peid = SplitBy(Peid,C) if Peid changes then Record C end if end while return Recorded Cs

SplitBy(Peid,C) P 0 = ∅ for A Peid do ∈ A0 = A T C P 0 = P 0 S A0,A A0 end for { \ } return P 0

However, searching for such a list of EV-Scenarios for each person separately may

be not efficient enough because extracting information, especially VID, is very time

consuming. If we can reuse EV-Scenarios for different people, the efficiency will be

improved. The basic idea is that one EV-Scenario can help distinguish all people

inside the EV-Scenario from all those outside. So one V-Scenario has the potential

47 to be reused by all EIDs contained in the corresponding E-Scenario. And we only

need to process this V-Scenario once.

Another idea to further reduce the number of EV-Scenarios we need: VIDs that

have been already matched may help distinguishing those remain unmatched. For

example, consider such two E-Scenario : one contains EID1 and EID2 and the other contains only EID1. Matching EID1 and VID1 is simple because the only VID1 in

the second V-Scenario can be safely matched to EID1. To match EID2, we can get its corresponding VID2 from the first V-Scenario by ruling out VID1 that appears

in the second V-Scenario.

Ideally we assume the consistency between EIDs and VIDs. That is, if a person

appears in one scenario, his or her EID and VID are exclusively contained in exact one same EV-Scenario at one time point.

Algorithm 1 is the algorithm for ideal setting. Details can be found in Dr. Li’s dissertation [44]. Next, we present the algorithm for practical setting.

B. Algorithm for Practical Setting

Practical Settings: In real world situation, E-Scenarios and V-Scenarios are not necessarily consistent, i.e. the EID and VID of the same person may not be in the same EV-Scenario. Major practical issues are as follows:

Drifting EID: Some EIDs may appear in wrong E-Scenarios (neighbor cell) • because of electronic noise. This is possible especially for those who are actually

located near the boundary of a scenario.

Missing EID: For people who do not carry any electronic device, they have no • EID. This will result in some additional VIDs appearing in V-Scenario lists

when matching other EIDs with their corresponding VIDs.

48 Missing VID: Due to occlusion and miss detection, we may fail to extract the • VIDs corresponding to a EID from some V-Scenarios.

In this setting, the ideal case algorithm can not be applied directly. Modifications are needed. We propose vague zone in EID Set Splitting to deal with drifting EID and matching refining process to handle missing EID and missing VID.

EID Set Splitting: Due to electronic noise, some EIDs may be slightly out of the

scenarios which they should be in. To tackle this problem, we introduce vague zones

in the scenarios. As shown in Figure 3.11, the area of a scenario is divided into two

parts: inclusive zone (the region far from the border) whose EIDs are considered

confidently included in this scenario, vague zone (the region near the border) whose

EIDs are also included and are marked as vague (which means that it is not sure if the

EIDs should appear in this scenario). The area outside the vague zone (outside of the

scenario) is denoted as exclusive zone whose EIDs are excluded from this scenario.

Accordingly, we slightly modify the definition of EV-Scenario by extending one

single time point to a certain period of time. Specifically, we count the occurrence

of EIDs within this time period. The EIDs which appear mostly are considered in

the inclusive zone, the ones who appear adequately are considered in the vague zone,

and the ones who appear occasionally are considered in the exclusive zone. Each EID

within an EV-Scenario is associated with an attribute value which is either inclusive

or vague indicating whether the EID is in the inclusive zone or vague zone.

Next, we describe our algorithm. Intuitively, we should try to avoid using EV-

Scenarios with the target EID in the vague zone to distinguish that EID. So when performing set splitting, we should not simply split one set into two sets as previous

49 Figure 3.11: Vague zone of scenario in spatial domain

mentioned. In fact, we focus on splitting the EIDs that are inclusive both in the original set and the given E-Scenario. One example is shown in Figure 3.12.

Figure 3.12: An example of algorithm for practical setting

VID Filtering: After EID set splitting, we get a list of effective E-Scenarios for each EID. Due to introduction of vague zone, the corresponding VID may or may not show up in the corresponding V-Scenarios. Notice that for the list of effective

E-Scenarios of a given EID, this EID only appears in the inclusive zone or the ex- clusive zone. We can use the same similarity measurement as in the ideal setting.

50 However, we may not get acceptable result in one run because of the practical issues we mentioned before. We need some refining mechanism.

Matching Refining: VID missing is the the most challenging problem for VID

filtering. The performance depends on the missing rate. If missing rate is 0, it becomes the the ideal setting. But if the missing rate is relatively high, VID filtering may fail to find the right VID. To handle this, we can collect the corresponding EIDs of these undistinguished VIDs and go through the EID set splitting and VID filtering again to refine the result until it is acceptable. However, if missing rate is too high, human intervention may be required to manually find the right VID.

Algorithm 2 Algorithm for matching refining MAIN() Peid = Ueid while true{ } do ES = SetSplit(Peid,Ssce) MM = V F ilter(ES) if MM is acceptable then break end if update Peid end while return MM

C. Algorithm Analysis

Here we analyze the correctness and complexity of the EID set splitting algorithm for practical setting.

51 Since there may be multiple vague EIDs in multiple sets, the whole EID dis- tinguishing process slows down. More E-Scenarios are needed to distinguish EIDs according to the percentage of vague EIDs.

Correctness of the algorithm is proved as follows.

Theorem 3. In practical setting, with the E-Scenarios recorded by the end of Algo- rithm 2, all EIDs can be distinguished in order if there are adequate EV-Scenarios generated from EV datasets.

The proof is similar as the ideal setting. Except that when one effective E-Scenario splits one node: left child contains the EIDs appearing in the E-Scenario with inclu- sive attributes (if they are inclusive in both the E-Scenario and the original node) or vague attribute (otherwise), while the right child contains the rest EIDs with their original attribute and EIDs appearing in both the E-Scenario and the original node with vague attribute;

Now we analyze the efficiency of the algorithm.

2 Theorem 4. In practical setting, log(n) v n effective E-Scenarios are adequate to distinguish n EIDs.

Lower bound: Similar as the ideal setting. •

Upper bound: In the worst case, we need n effective E-Scenarios to distinguish • each EID. So n2 effective E-Scenarios are sufficient to distinguish all n EIDs.

3.2.4 Parallelization with MapReduce

In the large-scale EV matching problem, the dataset usually has a large spatio- temporal span and its volume is big, especially for the visual data. We propose to

52 Figure 3.13: EID set splitting workflow

use MapReduce, one of the most popular frameworks for big data processing, to par- allelize EV matching algorithm. Both EID set splitting and VID filtering algorithms are implemented in MapReduce framework.

A. MapReduce Basics

MapReduce is a programming model for processing large-scale datasets in a dis- tributed way among multiple machines. Its execution process has four stages: split, map, shuffle and reduce. The input data is firstly split into smaller chunks and sent to different machines (mappers). During the map stage, each chunk is transformed into a (key, value) pair based on the logic defined in the map function. Then all the

(key, value) pairs from all mappers are shuffled, sorted to put in order and grouped.

In the reduce stage, a few machines (reducers) aggregate the shuffled pairs with the same key and output the final results. During the entire process, all data are stored in an underlying distributed file system. Data assignment, Map/Reduce task scheduling, and task failure recovery are managed by a master machine.

B. Parallelization of EID Set Splitting

The set splitting described in Algorithm 1 is a single-thread procedure and ineffi- cient, because in each iteration only one scenario is selected and set splitting affects

53 merely the EIDs contained in that scenario. We consider to fetch a number of sce- narios in each iteration and split the set more efficiently in a parallel way.

To parallelize the set splitting, we exploit the shuffle of (key, value) pairs in

MapReduce framework to implement set intersection operation of EID partitions and E-Scenarios. We break the procedure into four steps, preprocess, map, reduce and merge. Algorithm 3 defines the functions for the four steps of one iteration. The worklow of the whole procedure is shown in Figure 3.13.

Algorithm 3 Algorithm for one iteration Preprocess Input: List < EID > eidlist, EIDP artition eidpart, List < E Scenario > escelist − eidsetlist1 Filter escelist by a random time stamp eidsetlist2 ← Filter eidsetlist1 by eidlist eidsetlist3 ← Integrate eidsetlist2 with eidpart return eidsetlist← 3

Map Input: EIDSet eidset for each eid in eidset do emit(eid, EIDSetID) end for

Reduce Input: EID eid, List < EIDSetID > eidsetidlist emit(eidsetidlist, eid)

Merge Input: List < EIDSetID > eidsetislist, List < EID > eidlist emit(eidsetidlist, eidlist)

Preprocess: The input for preprocess are a list of EIDs which need to be matched •

with their VIDs, an EID partition which could be the set Ueid or the result

54 from previous iteration and a list of E-Scenarios which contains untouched E-

Scenarios in the database. Note that each E-Scenario is actually an EID set

and it has a unique set ID and a timestamp. We randomly choose a timestamp

and select all the E-Scenarios with this timestamp.

Then we further filter out the E-Scenarios which do not contain any of the EIDs

which need to be matched. The remained E-Scenarios are integrated with the

latest EID partition into a complete input for next step.

Map: The input of the map function is an EID set which is the element of the • output from preprocess. We use the element of the EID set as the key and

the set ID as the value. Such (key, value) pairs are emitted as the output of

mappers. After this stage, all the (key, value) pairs with the same EID as the

key will be shuffled to the same reducer.

Reduce: The reducer receives multiple EID set IDs and a single EID which • appears in the intersection of these sets. It directly emits a (key, value) pair

with the EID set IDs as the key and the received EID as the value. All these

pairs from many reducers will be aggregated by the key and sent to merge step.

Merge: The merge function is similar to the reduce function. It receives a list • of EID set IDs and a list of EIDs which is the intersection of those sets. In fact,

this intersection is just one element of the new partition. The merger emits the

set IDs and their intersection as the output. Then all the intersections from all

mergers are collected to form a new EID partition. And set IDs of the scenarios

which contribute to the new partition will be recorded.

55 So far, one iteration completes and a finer partition of EID universal set can be fed into the next iteration. After all the EIDs are distinguished, which means each of these EIDs makes up one set by itself in the partition, the EID splitting stage completes and VID filtering is triggered.

C. Parallelization of VID Filtering

After EID set splitting, a list of E-Scenarios is selected for each EID. The cor- responding V-Scenarios are then selected for each EID. For VID filtering, we need detect human objects in the visual data, extract object features and compare the features to figure out the same VID appearing in the V-Scenarios. Such processing can be time consuming. To speedup, we use MapReduce to parallelize human detec- tion and feature extraction by processing different V-Scenarios on different mappers.

Because these visual operations require no data dependancy. VID features are com- puted and stored in distributed storage system. Then we reorganize the processed

V-Scenarios as the input for another MapReduce procedure. The V-Scenarios in the selected list of one EID will be conveyed to the same mapper to do feature com- parison. In this way, VID feature comparisons for multiple EIDs are performed in parallel.

3.2.5 Evaluation

In this section, we evaluate our proposed algorithm on a synthetic dataset in real-world clusters. We will present the results below.

A. Experiment Setup

We set up a cluster with 14 machines. Each machine has a four-core (2.4 GHz) processor, 8 GB RAM and 2 TB storage. All machines are connected via gigabit

56 Ethernet over a local switch. Apache Spark 1.3.1 and Hadoop 2.7.0 are deployed on

each machine.

In our experiments, we have a database with 1000 human objects each associated

with an EID and a VID. All the VIDs are images of humans from the CUHK02 [47]

dataset. The images are extracted snapshots of humans captured by cameras from

different views. We assign WiFi MAC addresses to the human objects as their cap-

tured EIDs. For EV-Scenario generation, we randomly choose a number of human objects from the database and distribute them across a 1000 m 1000 m spatial × region which consists of several cells. For mobility, we employ the random waypoint

model [15] to control each human object’s movement in terms of location, velocity

and acceleration change.

B. Evaluation Results

We compare our algorithm with EDP, a baseline method proposed in [78]. How- ever, EDP can only handle one EID at one time. For fair comparison with our paral- lelized method, we adapt EDP to MapReduce framework by assigning each mapper one EID matching task. We use two metrics to evaluate the performance of our al- gorithm, time efficiency and accuracy, both of which are key concerns when dealing with large-scale surveillance data. We have two experiment settings. In one setting, we change the number of EIDs which need to be matched. In the other one, we vary the EID density in the region, i.e., the average number of human objects in each cell.

In the evaluation results, we use SS to denote our algorithm since it is based on set splitting.

Matching accuracy is defined as the percentage of the correctly matched EIDs.

Evaluations on ideal settings can be found on [44]. Under practical settings, we

57 100 E miss rate = 1% 10% 95 30% 50%

90

85 Accuracy (%) 80

75

70 200 400 600 800 Number of matched EIDs (a) SS

100 E miss rate = 1% 10% 95 30% 50%

90

85 Accuracy (%) 80

75

70 200 400 600 800 Number of matched EIDs (b) EDP

Figure 3.14: Accuracy vs EID missing

evaluate our matching algorithm and compare it with EDP. We consider two practical settings, EID missing for example some people do not carry their phones, and VID

missing for example human objects are miss detected. In Figure 3.14, we measure

the matching accuracy under different EID missing rates. Generally the accuracy

drops when the EID missing rate rises. However, even when the missing rate goes up to 50%, the matching accuracy is still good at around 85%. Figure 3.15 shows the

58 100 V miss rate = 2% 95 5% 8% 10% 90

85

80

Accuracy (%) 75

70

65

60 200 400 600 800 Number of matched EIDs (a) SS

100 V miss rate = 2% 95 5% 8% 10% 90

85

80

Accuracy (%) 75

70

65

60 200 400 600 800 Number of matched EIDs (b) EDP

Figure 3.15: Accuracy vs VID missing

matching accuracy under different VID missing rates. We can see that VID missing has more negative impacts on the matching accuracy. However, by matching refining, our accuracy is still above 80% even when the missing rate is as high as 10% which is far below the state-of-art human detection performance. Also our algorithm yields better accuracy than EDP.

59 3.3 Summary

In this chapter, we have presented the VM-Tracking system and the EV-Matching system. VM-Tracking is an accurate and real-time human tracking system by in- tegrating visual camera and motion sensor data in small scale environments. We proposed an appearance-free tracking algorithm and a physical location based VM fusion algorithm to track visual human objects with the assistance of their motion sensors. They significantly mitigate the dependency of long trajectories, high compu- tation overhead of video processing and the occlusion problem, and thereby provide continuous and accurate tracking results. We have conducted comprehensive experi- ments based on large scale real-world implementation. The results show the superior performance of our system in terms of time efficiency and tracking accuracy.

We have also studied human tracking system in large scale environments. We propose EV-Matching to match heterogeneous EV data according to spatiotemporal information to achieve efficient large-scale surveillance. In the matching procedure,

EID set splitting algorithm is designed to reduce the amount of visual data to be processed. To cope with the big data problem, MapReduce is utilized to fuse EV data in parallel. Elastic matching size is supported by this algorithm and multiple

EID-VID matching can be performed simultaneously. We implement the distributed algorithms on Spark and test it on a large synthetic EV dataset. The evaluation results demonstrate the feasibility and efficiency of our algorithms.

60 Chapter 4: Mobile Robot Tracking in Small and Large Scale Environments

In this chapter, we study the problem of mobile robot tracking in small and large scale environments. First, we present S-Mirror, a novel approach that “reflects” am- bient visual signals towards mobile robots and greatly extends their sensing abilities for tracking purposes in small scale environments. Next, we study vision-based large scale tracking for mobile robots. We propose BridgeLoc, a new technique that inte- grates robots’ on-board cameras with infrastructural cameras to track mobile robots in large scale environments.

4.1 S-Mirror: Mirroring Sensing Signals for Mobile Robots in Indoor Environments

4.1.1 Overview

In the near future, we envision that many mobile robots work for or interact with humans in various indoor scenarios such as guided shopping [30], policing [55] and senior care [89]. In these scenarios, mobile robots need fundamental functionalities such as localization and communication [42]. They have on-board sensors such as motion sensors and ultrasound; mobile robots can communicate via Bluetooth, WiFi, and cellular data protocols. However, mobile robots’ sensors are insufficient to realize

61 these scenarios. Mobile robots’ ultrasound sensors are noisy due to environmental complexity. Motion sensors suffer from accumulated errors over time [91]. Thus, infrastructural support is necessary.

There are several possible infrastructures for supporting mobile robots. One pos- sibility is to use heavy infrastructure support in which sensors such as pan--zoom cameras [19], passive RFID tags [20] and Radio Frequency antennas [50] are deployed to cover the entire environment and a powerful central server directly issues commands to mobile agents in indoor areas. However, this places costly demands on the envi- ronment and may be infeasible in many cases such as in legacy buildings [7]. Another possibility is to use a very lightweight solution in which lines, magnetic strips [10], ceil- ing lights [70, 79], or acoustic anchors [63] are deployed to guide robot motion. Such obtrusive deployment significantly changes the environment’s appearance; hence, it is disturbing.

This dissertation presents S-Mirror, our solution that provides infrastructural sup- port for multiple mobile robots in indoor environments. S-Mirror forms a “signal mirror” network of S-Mirror nodes. S-Mirror nodes “reflect” ambient signals toward mobile robots to greatly extend their sensing capabilities. S-Mirror nodes mainly re-

flect visual signals as they contain rich contextual information as well as other types of signals such as electronic and acoustic signals, which provides considerable flexi- bility. Figure 4.1 illustrates S-Mirror in a real-world indoor scenario and Figure 4.2 illustrates S-Mirror’s concept of signal reflection in this scenario.

S-Mirror extends mobile robots’ sensing capabilities and provides reference posi- tioning information. The infrastructure is lightweight, flexible, and scalable. When mobile robots need to follow persons, S-Mirror nodes reflect views of persons to assist

62 Healthcare

Shopping

Police

Figure 4.1: S-Mirror Real-World Scenario

S-Mirror

Healthcare

Shopping

Police

Figure 4.2: S-Mirror Conceptual Scenario

robot estimation of their movement. If mobile robots lose track of them, S-Mirror’s reflected views can help robots relocate them. In addition, S-Mirror scales to sup- port many mobile robots. When reflecting intensive data like visual images, S-Mirror nodes only send changed areas in their views to mobile robots. This reduces com- munication overhead as mobile robots can maintain their most recent views of areas.

Furthermore, when S-Mirror nodes need to reflect voluminous data to many robots, they broadcast data at once instead of sending duplicate data to each robot individ- ually. Our experimental results show this approach can save up to 90% of network bandwidth consumption. Since S-Mirror leaves decisions on the use of existing infras- tructure entirely to individual mobile robots, such flexibility also improves S-Mirror’s scalability on the robot side. Mobile robots choose to use their own sensors like motion sensors to localize themselves; they do not depend solely on S-Mirror nodes’

63 sensing data. Our experiments show that mobile robots avoid high communication

delay while achieving similar localization accuracy.

In summary, our contributions in this work are as follows:

We propose the S-Mirror system that “mirrors” ambient signals to assist mobile • robots with high flexibility and scalability via dynamically switching between

communications modes (i.e., unicast and broadcast).

To illustrate the advantages of our S-Mirror system, we propose a mobile robot • localization approach that integrates the S-Mirror system with mobile robots’

own motion sensors for enabling accurate timely self-localization.

We implement the S-Mirror system and our own robot in real world, and eval- • uate our system in real-world scenarios using our own robot.

4.1.2 S-Mirror Design

In this section, we present the design of our S-Mirror system. First, we introduce each component of the system and explain components’ cooperation with each other, which forms a lightweight and flexible system. Next, we detail our system’s workflow as well as its advantages. Specifically, it has low computation and communication overhead and high scalability for supporting many mobile robots. At last, we design a mobile robot localization approach that integrates the S-Mirror system with mobile robots’ own motion sensors for enabling accurate timely self-localization.

A. System Components

The S-Mirror system consists of at least one S-Mirror node. No central server is required for coordination and synchronization of its components. As visual images provide direct contextual information about the indoor environment in which S-Mirror

64 is deployed, we design S-Mirror nodes to sense and reflect such images to assist mobile robots’ various needs. As shown in Figure 4.3, each S-Mirror node has a video camera as well as other potential sensors such as electronic, acoustic, and infrared sensors. It performs lightweight processing on sensing data and communicates the data to mobile robots. We explain processing and communication in the next subsection. Based on

S-Mirror nodes, mobile robots obtain visual images from a very broad view, which greatly assists their applications and activities in the environment.

Processing Lightweight Vision Electronic Other Sensor Processing Measurement Processing

Communication Sensing Electronic Visual Sensor Other Sensors Sensor

Figure 4.3: S-Mirror node module

B. System Workflow

Each component of S-Mirror system runs independently, which avoids a single point of failure and achieves cost-effectiveness. We design S-Mirror nodes to commu- nicate with mobile robots in an “on-demand” manner: mobile robots request sensing data as needed and S-Mirror nodes only send data to them. Since visual data is very intensive, S-Mirror nodes only send foreground images (i.e., images containing mobile robots excluding backgrounds). Furthermore, too many mobile robot requests could entail unnecessary duplication of transmitted visual data, which may cause delays

65 and network congestion. Thus, S-Mirror nodes broadcast visual data to all mobile

robots in order to reduce communications delay and minimize network bandwidth.

On-demand cast

Request received? V-DB V sensing

Visual motion detection

E sensing Big motion E-DB change? E processing Y

Cast mode decision

Unicast Broadcast E-beacons videos (and videos)

Figure 4.4: S-Mirror node workflow

Figure 4.4 shows an S-Mirror node’s system workflow. Electronic (E) signals are continuously sensed, processed, and stored in an electronic database (E-DB).

The S-Mirror node periodically broadcasts these signals as well as S-Mirror node configuration information to all mobile robots. In general, electronic signals are much less data-intensive than visual signals. The left side of Figure 4.4 shows the system workflow for visual signals. In the background, the S-Mirror node visually senses the area; the visual data are stored in a database (V-DB). Due to visual data’s intensity,

S-Mirror node will not send these data to mobile robots until the robots request data.

66 Once the S-Mirror node receives a request, it performs visual motion detection. If no significant visual motion is sensed, the S-Mirror node sends the “old” image to mobile robots; otherwise, it sends the “new” image. Next, S-Mirror node decides whether visual data are unicast or broadcast based on the number of requesting mobile robots.

For large numbers of mobile robot requests, unicast entails unnecessary duplication of transmitted visual data, which may cause delays and network congestion. In this case, broadcast is applied for all mobile robots.

C. Mobile Robot Localization with S-Mirror

Localization is a fundamental enabling technology for many applications such as navigation and patrol. To illustrate the advantages of our S-Mirror system, we pro- pose a mobile robot localization approach that integrates the S-Mirror system with mobile robots’ own motion sensors for enabling accurate timely self-localization. Our proposed approach effectively address the occlusion problem by closely integrating lo- calization via S-Mirror’s sensing images and dead reckoning via mobile robots’ motion sensors.

When there is a line of sight between the S-Mirror node and mobile robots (i.e.,

S-Mirror node can see mobile robots), mobile robots can easily detect themselves in images capturing them. However, in the area that S-Mirror node visually cov- ers, mobile robots may not detect themselves due to occlusion, detection failure, etc.

Mobile robots need a strategy continuously localize themselves between two success- ful S-Mirror localizations. Motion sensors such as odometers on mobile robots can help address this problem. Specifically, we use dead reckoning with motion sensors combined with basic S-Mirror localization.

67 Start localization

Motion sensor initialization

Basic S-Mirror Motion sensor localization measurement

Dead reckoning

Motion sensor calibration Basic S-Mirror Y N localization success?

Report odometer Report S-Mirror/V-Tag location result localization results

Figure 4.5: Localization workflow with motion sensors

The continuous localization workflow with motion sensors is shown in Figure 4.5.

Generally, the basic S-Mirror localization result will be reported. However, when basic localization fails, the mobile robot localizes itself using its on-board motion sensors. A common issue is motion sensors’ accumulated error drift over time. S-

Mirror localization is promising for periodic calibration of motion sensors’ precision.

If S-Mirror localization is available, we calculate distances and angular offsets between

S-Mirror and dead reckoning. Before the next successful S-Mirror localization, we deduct the offset from motion sensor measurements to avoid accumulating error.

68 4.1.3 Implementation and Evaluation

In this section, we present our system implementation, report our experimen- tal results for the S-Mirror system, and show the system’s performance in terms of localization accuracy, time, and network bandwidth consumption.

A. Implementation

The implementation of our system has two main components: a S-Mirror system that collects and “mirrors” sensing signals; a mobile robot system that communicates with S-Mirror and localizes itself based on the sensing data from S-Mirror and its own motion sensor.

As described in Section 4.1.2, the S-Mirror system consists of one or more S-

Mirror nodes. In each S-Mirror node, we use commodity cameras to monitor the area of interest and inexpensive laptops to mirror the visual sensing signals.

We design and produce a mobile robot to evaluate our S-Mirror system. Our robot is a three-wheel robot with two back wheels controlling its movement and one front wheel adjusting its direction. Its two back wheels have encoders that count their rotation rates and together serve as a motion sensor. The robot has two LED lights as visual markers that S-Mirror nodes can see and detect clearly. In the future, we will replace LED lights by unobtrusive markers such as infrared light in order to minimize disturbance. The robot receives video at 3 frames per second from the S-Mirror node.

We choose this low frame rate to demonstrate our system’s lightweight nature and robustness.

B. Experiment Setup

We evaluate our system in a real-world environment. It is a three-room indoor area with several corners. We deploy three S-Mirror nodes to visually cover most of the

69 area. Our robot automatically patrols in the area following several predefined paths.

It localizes itself continuously and adjusts its movement accordingly. Six people walk

in the environment who may block the S-Mirror node’s view.

We use three metrics to evaluate our system: localization error, localization delay,

and the S-Mirror system’s transmitted data rate. Localization error is defined as the

average distance between the robot’s path that it localizes alone and its ground truth

path. Localization delay is the average time for our system (S-Mirror and robot) to

localize the robot at each position along its path. Delay has three parts: sensing delay

(from sensors’ sensing period), communication delay (from S-Mirror to the robot) and

processing delay (the robot’s computer). The sensing delay is set to 40 ms; data are

sampled at 25 Hz. We measure the other delays during our experiments. Transmitted

data rate is the ratio of data sent by all three S-Mirror nodes over time; it measures

our system’s network bandwidth consumption.

We compare the localization approach proposed in Section II-C (denoted S-Mirror

+ Motion Sensor) with a virtually centralized S-Mirror system (denoted Centralized

S-Mirror) that relies only on S-Mirror sensing data to perform localization. Since

Centralized S-Mirror does not need to transmit sensing data to robots or broadcast data, we only compare its localization error and delay to our proposed approach.

C. System Performance for Single Point Localization

Distance to S-Mirror Node: First, we evaluate our system via single point localization.

We place the robot at different distances to the S-Mirror node.

Figure 4.6(a) shows localization error. The experimental result shows that both approaches have the same error within visual coverage of the S-Mirror node as they use imagery from it. In this area (2–11 m), error increases slowly from 0.01–0.5 m.

70 (a) Localization error (b) Localization delay

Figure 4.6: Localization performance at different distances

In areas outside visual coverage, Centralized S-Mirror fail as there are no viable

sensing data; their errors increase greatly. S-Mirror+Motion Sensor has reasonable

localization accuracy; average error is 0.33 m.

Figure 4.6(b) shows the localization delay. Within visual coverage area of S-

Mirror node, the delay of S-Mirror+Motion Sensor is 103 ms. This is 51 ms higher

than Centralized S-Mirror due to network communication delay. In areas outside

visual coverage, S-Mirror+Motion Sensor’s delay drops significantly to 40 ms as ∼ network communication is unnecessary and motion sensor based localization has <

1 ms processing delay. Centralized S-Mirror’s delay is similar to S-Mirror+Motion

Sensor’s as the former still processes images, which has 12 ms processing delay. ∼ Table 4.1 shows the transmitted data rate. When the mobile robot is under S-

Mirror visual coverage, the average rate is 174 kBps, which is low as the S-Mirror ∼ node only transmits foreground images. When the mobile robot is in areas out of visual coverage, S-Mirror+Motion Sensor does not need images from the S-Mirror

71 Within Visual Out of Visual Coverage Coverage S-Mirror+Motion 174 kBps 1.34 kBps Sensor

Table 4.1: Transmitted data rate in single point localization

nodes. The average transmitted data rate is 1.34 kBps due to S-Mirror nodes’ periodic

broadcasting.

Request Amount to S-Mirror Node: We also evaluate system performance with mul- tiple requests to the S-Mirror node. As we only have one robot prototype, we request images from an S-Mirror node on several laptops. This scenario simulates multiple robots’ requests and tests our system’s scalability with multiple robots. The localiza- tion performance is measured on our robot prototype.

Figure 4.7(a) shows localization error. The S-Mirror node switches to data broad- cast with at least 15 mobile robot requests. S-Mirror+Motion Sensor is very robust to heavy traffic requests as motion sensors are always available to use. Figure 4.7(b) shows the delay and confirms this. After an S-Mirror node switches to broadcast data, its communication delay remains 250 ms. S-Mirror+Motion Sensor uses mo- ∼ bile robot sensors for delay reduction. Centralized S-Mirror’s delay increases with the number of mobile robots as its computation workload increases.

Table 4.2 shows the transmitted data rate for multiple mobile robots. The S-

Mirror node is robust to heavy request traffic. When there are at least 15 mobile robot requests, the node switches to broadcast data to save network bandwidth and avoid congestion. This mechanism bounds our system’s network bandwidth consumption.

72 (a) Localization error (b) Localization delay

Figure 4.7: Localization performance in multiple requests

Number of mobile 1 5 10 15 20 robot requests S-Mirror+Motion 174 874 1551 113 113 Sensor (kBps)

Table 4.2: Transmitted data rate in multiple requests

Although the data loss rate rises in broadcast mode, mobile robots can use their own sensors to reduce communication overhead and ensure high localization performance.

D. System Performance in Continuous Localization

Continuous Localization without Occlusion: In this experiment, the mobile robot patrols the test area along four different paths. Each path is 40 m long with segments inside and outside visual coverage of the S-Mirror node. No one walks in the area or blocks the view of the S-Mirror node. We measure the average localization error, delay, and transmitted data rate in each 1 m segment along the path.

73 (a) No occlusion (b) Six-person occlusion

Figure 4.8: Localization error in continuous localization

Figure 4.8(a) shows localization error. S-Mirror+Motion Sensor’s localization er- ror increases in areas outside visual coverage, especially when turning (e.g., at 9 m).

For Pure or Centralized S-Mirror approach, its localization error drastically increases to above 0.5 meters when our robot enters non-visually covered area.

Figure 4.9(a) illustrates our system’s time performance. Our proposed approaches have similar delay at 106 ms, which is efficient for most application scenarios. Time ∼ performance without occlusion is similar to the six-person occlusion case. Likewise, the transmitted data rate shown in Figure 4.10(a) has similar performance to the six-person occlusion case. We discuss this in detail later.

Continuous Localization under Occlusion: For experiments under occlusion, six peo- ple walk freely in the test area, which creates random occlusions. The mobile robot patrols the area along four different paths. We measure the same metrics as in occlusion-free experiments.

74 (a) No occlusion (b) Six-person occlusion

Figure 4.9: Localization delay in continuous localizationd

Figure 4.8(b) shows localization error. Similar to occlusion-free experiments, S-

Mirror+Motion Sensor’s error increases in areas outside visual coverage. Its error drift accumulates due to occlusion. However, as mobile robot and people move at most time, such occlusions disappear quickly. Thus, drift error from motion sensor can mostly be calibrated in time, which leads to satisfying accuracy. The localiza- tion accuracy of Centralized S-Mirror is significantly affected by areas outside visual coverage and human occlusion. Thus, its localization fails frequently.

Similar to the occlusion-free case, our approach has 106 ms localization delay. ∼ Our approach’s delay is robust to occlusions. Figure 4.9(b) illustrates the delay of an instantaneous path. It shows that frequent occlusions actually reduce a certain amount of delay. However, the robot still needs to periodically check for occlusion resolution from S-Mirror. Thus, the overall average delay remains similar to the occlusion-free case. For the same reason, our approach is also robust to occlusions in terms of transmitted data rate as shown in Figure 4.10(b). Our approach’s average

75 (a) No occlusion (b) Six-person occlusion

Figure 4.10: Transmitted data rate in continuous localization

transmitted data rates are 153–170 kBps. They consume limited bandwidth and do not affect normal network communication.

4.2 BridgeLoc: Bridging Vision-Based Localization for Robots

4.2.1 Overview

Robots are increasingly pervasive in daily life. In 2014, about 4.7 million service robots were sold for purposes such as healthcare, elder assistance, household cleaning, and public service (police patrols and museum tour guides) with sales projected to reach 35 million units between 2015 and 2018 [2]. In order to serve humans well in a variety of indoor scenarios, mobile robots need to localize themselves accurately, scalably, and cost-effectively.

Recent years have witnessed extensive studies on indoor localization for robots, especially vision-based approaches [21,37,54,60,62,72]. This is partly driven by wide adoption of robot cameras and/or infrastructural cameras deployed indoors. More- over, visual sensing provides rich information about the surroundings that greatly

76 eases robots’ operations and interactions with humans. Most existing vision-based robot localization approaches use robots’ cameras to detect visual features in the en- vironment before localization. However, such solutions suffer from limited scalability and accuracy as they require a large number of unique visual features covering an entire area, which is infeasible for large areas. Some representative approaches deploy artificial visual tags such as QRCode [37] and AprilTag [62] as visual features in the environment due to these tags’ high accuracy and reliability. But these tags need to adopt complex visual patterns in order to accommodate a large “pool” of unique tags.

Such complex visual patterns shorten effective ranges as it is difficult to detect tags at far distances, which entails more unique tags. Eventually, this “negative feedback loop” limits the applicability of artificial-tag-based approaches to small areas only; it is hard to design enough tags to cover large areas. Other work relies on natu- ral environmental patterns as unique landmarks [21, 54, 60, 72]. Yet such approaches can hardly achieve satisfactory accuracy as natural landmark detection is sensitive to lighting conditions, camera quality, and other practical factors [29]. Worse, natural landmarks are opportunistic and guaranteeing their effectiveness indoors is tricky.

For example, there are few natural landmarks in areas such as corridors. We discuss details of these challenges and the limitations of existing localization technologies in

Chapter 2.

To tackle these problems, we propose BridgeLoc, a novel vision-based localiza- tion system that closely integrates both robots’ and infrastructural cameras. Since infrastructural cameras have been widely deployed in the real world, it is natural to incorporate them in robot localization without additional cost. Although in- frastructural cameras’ views cannot fully cover entire areas due to the complexity

77 infrastructure camera (IC) IC’s view

Visual tag A B Robot

Robot Camera’s view

Figure 4.11: Illustration of BridgeLoc.

of practical indoor environments, they afford us a major opportunity for accurate, scalable, vision-based robot localization that existing work does not achieve. As in- frastructural cameras can partially localize robots in their views’ coverage areas, it reduces the number of unique artificial visual tags required, which lets us use simpler patterns with longer detection ranges than the patterns in existing work. Longer detection ranges entail fewer tags. Eventually, this “positive feedback loop” enables the application of artificial visual tags in large areas. Thanks to high localization accuracy of infrastructural cameras and artificial visual tags, BridgeLoc achieves both high accuracy and scalability for indoor robot localization.

In particular, we design three key technologies: 1) robot and infrastructure camera view bridging, 2) rotation symmetric visual tag design, and 3) continuous localization using robot visual and motion sensing in order to address the fundamental limita- tions of existing vision-based approaches. We first design an approach that “bridges” robots’ and infrastructural camera views to accurately localize robots at far distances.

Figure 4.11 shows how BridgeLoc works. We target real-world scenarios where some infrastructural cameras have been deployed but they fail to cover the whole area for

78 indoor localization. Thus, we deploy certain visual tags that can be uniquely iden- tified in infrastructural cameras’ views. If the robot lies outside any infrastructural camera’s view but a visual tag appears in both views (e.g., at spots A and B), we perform vision-based localization twice and independently derive the relative posi- tion between the robot and the visual tag as well as that between the tag and the infrastructural camera. Since the tag is static, it serves as a “bridge” to associate these two relative positions. As artificial tags can be accurately localized by robots’ and infrastructural cameras, we accurately obtain the robot’s absolute position by bridging both views. Moreover, our view bridging technology does not require full visual coverage of infrastructural cameras as they can partially localize robots via indirect bridges (visual tags). It also reduces the number of unique artificial visual tags without covering the whole area. This lets visual tags use simpler patterns that support longer detection ranges than patterns in existing work. Later, we show that

BridgeLoc develops a new tag design that greatly relaxes the constraints of visual tags and further scales our system to large areas using a small number of tags.

We also design a series of rotation symmetric visual tags, which is the key dif- ference between our tags and existing ones [37, 62]. In existing work, visual tags’ patterns need to be unique regardless of robots’ views, which requires patterns to be asymmetric with respect to rotation. We conduct a careful study of possible tag designs and find that this requirement significantly reduces the number of unique tags when visual patterns are complex. As our visual tags have very simple patterns, we can relax this requirement and design rotation symmetric tags when robot motion is physically constrained (i.e., robot motion must be continuous). This simplifies visual

79 tag patterns without sacrificing the number of unique tags, which eventually enables our system to scale to large areas.

We design and implement a probabilistic approach based on robots’ visual and motion sensing to enable continuous localization when robots lie outside infrastruc- tural cameras’ coverage areas and cannot see any visual tag. We implement our system in practical settings and build a prototype robot using COTS hardware. We conduct real-world experiments to evaluate our work. Experimental results show that our system achieves 32 cm localization error (on average) in an area over 100 m2.

In summary, our contributions in this work are as follows:

We propose a novel vision-based localization system that integrates robots’ and • infrastructural cameras closely to accurately localize robots in large areas. We

deploy visual tags as bridges to link robots’ and infrastructural camera views

to each other.

We propose a novel visual tag design method using rotation symmetric visual • patterns and we extend the number of unique tags. This method is only based

on practical constraints of robot motion and visual tag placement in the envi-

ronment.

We design a probabilistic approach to address the practical challenge where • robots lie outside infrastructural cameras’ coverage areas and cannot see any

visual tags.

We implement our system, build a prototype robot using COTS hardware, and • conduct real-world experiments to evaluate our work.

80 Zc xc (xc,0,zc) f Camera zc Yc coordinate system (xc,yc,zc) u cz Lens Camera Image Plane Zc Xc Sensor V (u,v) (x,y,z) Xc U c in C-XYZ Yc z Zg cy (xc,yc,0) f Yg y v cy Xg c Ground coordinate system for localization xc f (a) Localization using a single camera (b) Single camera projection

Figure 4.12: Vision-based localization using one camera.

4.2.2 Background

In this section, we introduce the background of vision-based localization using a

single camera, which is the basis of BridgeLoc.

A single camera captures three-dimensional real-world scenes and converts them

into two-dimensional images. This process is called projection. Figure 4.12 illustrates

how the real-world coordinates of a three-dimensional point are projected onto the

two-dimensional image plane. There are three coordinate systems. The first is the

ground coordinate system (G-XYZ), which offers absolute localization positions in

universal coordinates. The second is the relative camera coordinate system with the

camera center as the origin point (C-XYZ). The Yc and Zc axes form the surface

plane of the camera and the Xc axis is orthogonal to this plane. The last is the 2D coordinate system for the image captured by the camera (I-UV), where the origin point is set at the top left corner.

Figure 4.12(b) shows a single camera projection in detail following the ideal pinhole camera model [64, 75]. A three-dimensional object (on the left) is projected through

81 the camera lens (in the center) onto the two-dimensional camera sensor (on the right)

via a transformation. The lens has f; the x-axis indicates

depth. The resulting image appears on the image plane, which is formed by the axes

(U, V ). The camera has center (cy, cz) relative to the origin point at the top left

corner of the image.

The point (xc, yc, zc) is shown on the three-dimensional object. This point is pro-

jected onto the two-dimensional image plane at the corresponding image coordinates

(u, v). Similarly, any three-dimensional point is projected onto the two-dimensional

image plane.

There are two pairs of similar triangles in Figure 4.12(b). In the upper pair, the

left triangle has legs with lengths xc and zc; the right triangle has legs with lengths f and u cz. In the lower pair, the left triangle has legs with lengths xc and yc; the −

right triangle has legs with lengths f and v cy. By similarity, it follows that − ( yc = v−cy xc f (4.1) zc = u−cz . xc f The first equation comes from the upper pair of similar triangles. The second one comes from the lower pair. In these two equations, the length f is known. cy and cz are the coordinates of the camera center in the image plane.

In practice, zc also can be determined from the object’s height, which is known a priori. The object’s height is often available via calibration or sensing, especially in this work where the robot’s height is fixed when it “stands” or the height is known when its posture changes. In the simplest case where both the camera and the object

(here, a robot) stand vertical to the ground, the height difference ∆h is the object’s z-axis coordinate. We can find zc = ∆h from the ideal pinhole camera model shown in Figure 4.12(b). Then it is straightforward to solve Eq. (4.1). As a result, we can

82 localize the object using a single camera view: x = f ∆h  u−cz × y = (v−cy) ∆h . (4.2) (u−cz) × z = ∆h In reality, the camera may not be installed exactly vertical to the ground. The height information can be acquired via a transformation between the object’s real- world coordinate system and the camera coordinate system. This involves rotation and translation, which we elaborate in Section 4.2.3.

Based on the above discussion, a single camera can theoretically localize any object as long as it is captured in the image. The area seen by a camera is called its field of view (FoV). It has two key parameters: depth of field (DoF) and (AoV).

Ideally, a camera can see anything infinitely far away when there are no obstacles in between. However, in reality, when the object is far away from the camera, it becomes too small to recognize. This is constrained by the image resolution, which depends on the lens (focal length) and the sensor (sensitivity). AoV, an angular extent of a given view, is also limited by focal length and the sensor size. Roughly,

AoV = 2 arctan(d/2f), where f is the focal length and d is the size of the sensor in the direction measured [83].

4.2.3 BridgeLoc Design

In this section, we present BridgeLoc’s system design. We first describe the approach of bridging robots’ and infrastructural camera views to localize the robot.

Next, we introduce our novel tag design method to use rotation symmetric visual patterns and extend the number of unique tags. Finally, we present our probabilistic approach to enable continuous localization where robot lies outside infrastructural cameras’ coverage and cannot see any visual tag.

83 A. Bridging Two Views

Now we present how BridgeLoc leverages visual tags to bridge infrastructural and robots’ camera views together and accurately localize robots at far distances.

We focus on the typical case where the robot lies outside any infrastructural camera’s

FoV but a visual tag appears in both an infrastructural camera’s FoV and the robot’s

FoV, as Figure 4.13(a) illustrates. Table 4.3 describes the notation used in this work.

Both the infrastructural camera and the robot’s camera see the same visual tag.

By applying the vision-based localization described in Section 4.2.2, we can localize the visual tag individually in the camera coordinate system (denoted C-XYZ) and the robot coordinate system (denoted R-XYZ). Clearly, these two relative positions in two different coordinate systems correspond to the same absolute position in the ground coordinate system G-XYZ. The mapping between each relative position and the absolute position depends on the transformation of two coordinate systems. Each transformation is determined by the position (and posture) of the robot (unknown) and the camera (known). As a result, we can find the robot’s position by mapping the object’s coordinates in R-XYZ and G-XYZ ( 3 in Figure 4.13(b))) where the visual tag coordinates in G-XYZ are obtained via another mapping between the C-XYZ and the G-XYZ coordinate systems ( 1 ). This way, the visual tag serves as a bridge that associates the FoVs of the infrastructural and robot camera views and helps localize the robot from the infrastructural camera view.

(C) (R) (G) Mapping between Vo and Vo through Vo : This process involves two transforma-

(C) (G) (R) (G) (C) (R) (G) tions: Vo Vo and Vo Vo , where Vo , Vo and Vo are the coordinates → → of the object o in coordinate systems C-XYZ, R-XYZ, and G-XYZ, respectively.

84 " translation C-XYZ In C-XYZ (G) (G) (G) In G-XYZ x y z (G) (G) (G) (C) (C) (C) c c c x y z xo yo zo o o o & rotation R-XYZ = translation ZG (G) (G) (G) xr yr zr YG In R-XYZ (G) (G) (G) (R) (R) (R) xo yo zo G-XYZ xo yo zo & rotation

XG # (a) Illustration of the basic case (b) Localization via tag-assisted transforma- tion

Figure 4.13: Localization in the basic out-of-FoV case.

Each transformation consists of two operations: translation and rotation. For exam- ple, consider the transformation from C-XYZ to G-XYZ. Translation displaces each point by a fixed distance in a given direction. When there is no rotation and the X-,

Y-, and Z- axes are parallel in two coordinate systems, we have

(G) (G) (C) Vo = Vc + Vo , (4.3)

(G) where Vc is the absolute position of the camera in the ground coordinate system.

Rotation depends on an axis of rotation and the angle of rotation. Let θx be the angle of the Y-Z plane rotated counterclockwise about the X-axis. If we have rotation only about the X-axis, any point (x, y, z) is updated to a new position

x0 1 0 0  x 0 0 V = y  = 0 cos θx sin θx y = Υx(θx) V, (4.4) 0 − · z 0 sin θx cos θx z where Υx(θx) is a rotation matrix [32]. A generic rotation between two coordinate systems is parameterized by three angles θx, θy, and θz. Rotation matrices Υy(θy) and Υz(θz) about the Y- and Z-axes, respectively, are in a similar form as Υx(θx) in

85 Eq. (4.4), except the matrices’ respective axes of rotation.     cos θy 0 sin θy cos θz sin θz 0 − Υy(θy) =  0 1 0 , Υz(θz) = sin θz cos θz 0 . (4.5) sin θy 0 cos θy 0 0 1 − In reality, infrastructural cameras may not be placed perfectly vertical to the ground; thus, both translation and rotation operations occur. We use Θ(C→G) =

(C→G) (C→G) (C→G) (G) (θx , θy , θz ) to represent the coordinate rotation angles and Vc to rep-

resent the movement offset. Therefore, we obtain the mapping

(G) h (C→G) (C→G) (C→G) i (C) (G) Vo = Υx(θx ) Υy(θy ) Υz(θz ) Vo + Vc × × ·

= Υ(Θ(C→G)) V (C) + V (G), (4.6) · o c

(C→G) (G) where Υ(Θ ) and Vc are constants that can be calibrated and known a priori.

Similarly, for the relative position in the robot’s coordinate system, we have

V (G) = Υ(Θ(R→G)) V (R) + V (G), (4.7) o · o r

(G) where Vr is the robot’s position to be calculated. It follows that

V (G) = V (G) + Υ(Θ(C→G)) V (C) Υ(Θ(R→G)) V (R). (4.8) r c · o − · o

Localization via mappings of multiple points: Next, we discuss how to localize the

(G) robot in the ground coordinate system (i.e., how to derive Vr ). With FoV bridging,

the visual tag captured by infrastructural cameras and the robot can act as references

for robot localization. Due to the nature of camera sensing, we can find the visual

tag’s relative position (i.e., the object’s bearing and distance) with respect to the

robot simultaneously. As our visual tag is a square image (described in Section

4.2.3), one such tag has four corners as reference points. If we know the bearing

and distance of at least 3 reference points, we can calculate the robot’s position

86 via triangulation and trilateration. In practice, the robot usually stands upright

moving on the ground with consistent camera height. If the robot’s camera height

is known, we only need at least 2 reference points to find the robot’s position. To

address imperfect object bearing and image detection noise and error, we use more

reference objects via least squares regression, a standard approach to approximation

with overdetermined observations [41]. This improves robot localization accuracy and

robustness.

(G) Suppose m objects are detected and their ground positions are Voi , which are

(G) h (G) (G)i (R) obtained from Eq. (4.7), where 1 i m. Let Mo = Vo Vo , Mo = ≤ ≤ 1 ··· m h (R) (R)i Vo Vo , and 1m = [1 1], where 1m is a vector of all 1s with m scalars. 1 ··· m ··· From Eq. (4.7), we have

(G) (G) (G) (G) (R→G) (R) [V V ] = 1m V = M Υ(Θ ) M (4.9) r ··· r · r o − · o

(G) Thus, each column of Eq. (4.9) should be identical, namely, Vr . We find h i † (R→G) (G) (G) (R) † Υ(Θ ) V = Mo [Mo 1m] , where ( ) is the pseudo-inverse matrix. r | · (G) The robot’s position Vr is the last column of this matrix. This is actually the least-squares approach and we need at least 3 objects to find the robot’s position. In

(G) practice, if the height zr is known, we only need 2 objects to find this position.

B. Rotation Symmetric Tag Design

Each visual tag is simply a printed image with a particular visual pattern. In order to bridge robots’ and infrastructural cameras for robot localization, the tag patterns need to satisfy the following three requirements:

[R1] Tag patterns should provide at least two reference points for localization. We use a square tag and all four corners (also called vertices) serve as reference points. Tag patterns should enable the camera to automatically distinguish all reference points.

87 1 b 1 V (1,N) a1,1 =0 Surrounding a1,2 =1 borders U ··· ...... a =1 Code blocks . . ···.. . 1,N . . . . a2,1 =1 Reference ... points (N,1) ··· (N,N) aN,N =1 (a) Tag structure (b) Code block design

Figure 4.14: Visual tag structure and code block design.

[R2] Tag patterns should maximize the number of unique visual tags (the size of the

“tag pool”). Each visual pattern should be unique for identification purposes such

that the camera infrastructure identifies the camera that sees the exact same tag as

the robot.

[R3] Each visual pattern should maximize the detection range. This implies that

patterns should be detected easily and accurately even at far distances. However, due

to the nature of camera projection, farther away visual tag patterns appear smaller

in captured images, which hinders recognition. This raises a pattern design tradeoff

between maximizing the number of unique patterns and maximizing the range of

pattern detection.

In this section, we first present basic tag design that proposes a set of unique visual

tags to satisfy the above requirements. We elaborate the contradiction between R2

and R3. Finally, we describe our novel tag design to address this contradiction based

on practical constraints of robot motion.

88 Tag structure: Figure 4.14 depicts a basic structure for visual tag design. Each tag is displayed in a square. To improve localization accuracy and robustness, maximiz- ing the distance among the reference points is desirable. As a result, we choose the four corners of this square tag as the default reference points. Note that more ref- erence points within the visual tag (e.g., the centroid and the top-middle point) can be selected in order to localize the robot. Each visual tag consists of two parts: a surrounding black border and the central area for the code blocks. The surrounding border is pure black and eases detection of the tag border as well as four corner points.

The code block area consists of unique patterns and serves as the tag identifier. It comprises N N squares (Figure 4.14(a) shows a 2 2 example). Each square is × × either black or white. We choose these colors as they maximize contrast and pattern recognition with them is resilient to varying illumination and other environmental conditions. The ratio of the border and the pattern code is determined by the pa- rameter b. Intuitively, given a fixed tag physical size, smaller values of b offer larger border areas that make corner detection easier and more accurate. On the other hand, these values of b hinder code pattern detection (especially at a distance) and even limit the number of unique tags that can be supported. We use b as a “control knob” to trade off between corner (reference point) detection and the size of the tag set. In this work, we set b = 5 by default based on empirical results.

Uniqueness in code blocks: Theoretically, the robot and the infrastructural camera may see a visual tag from any direction; hence, the tag in the view plane may be rotated arbitrarily as Figure 4.15 illustrates. Thus, we derive two constraints to construct the set of unique tags:1

1 [62] provides a detailed explanation of derived constraints.

89 (a) 0◦ (b) 90◦ (c) 180◦ (d) 270◦ (e) 0◦ (f) 90◦ (g) 180◦ (h) 270◦

Figure 4.15: Two visual tag examples under circular shifts (0◦-, 90◦-, 180◦ , 270◦- clockwise rotations in turn). Example 1: (a)–(d); Example 2: (e)–(h). −

1. No tag should have the same pattern as itself after a rotation of 90◦, 180◦, or 270◦.

2. No tag should have the same pattern as another tag after a rotation of 90◦, 180◦, or 270◦.

Given N, we obtain the cardinality of unique tags by enumerating all possible tags

that satisfy the previous two conditions. We find that the overall number of unique

tags satisfies the following bound.

2 N 2−2 b N +1 c−2 Theorem 5. Given N, the overall number of unique tags is 2 2 2 . − The proof is as follows. Intuitively, the number of tags is 2N 2 . By constraint 1,

at most one of the four rotations remains, reducing the number 2N 2−2. By constraint

2 b N +1 c 2, we remove tags with rotational symmetry (e.g., Figure 4.15(a)). There are 2 2

patterns but 4 fewer unique ones. × For N = 2, there are 3 unique tags; for N = 3, there are 120. It seems feasible

to satisfy requirement R2 easily when N 3. However, this raises another issue: ≥ it makes it harder to detect tags at a distance, which reduces the maximum range

of tag recognition. We perform experiments to measure the impact of N as follows.

We print tags at three sizes (20 cm, 30 cm, and 40 cm on A4, A3, and A2 pieces

90 10 10 20cm 20cm 8 30cm 8 30cm 6 40cm 6 40cm 4 4 dmax 2 dmax 2 0 0 2 3 4 N 5 6 7 2 3 4 N 5 6 7 (a) Logitech C210 camera (b) JD-510 camera

Figure 4.16: Maximum of detection distance w.r.t N.

of paper, respectively) and place them apart from the test camera (angle = 0◦).

We test two cameras: Logitech C210 and JD-150. Figure 4.16 shows the maximum distance greatly shrinks as N increases. Even when N increases from 2 to 3, the image size of each square shrinks from 1/2 to 1/3. Thus, the maximum detection

1/6 distance decreases by 33.3% (= 1/2 ) Therefore, setting N = 2 is preferable to N = 3 in order to satisfy requirement R3. However, the number of unique tags for N = 2 is insufficient to distinguish a large number of visual tags to cover a large area. Hence, we next leverage physical constraints and propose a much larger pool of “practically unique tags.”

Rotation symmetric tag: We find that, in practice, tag patterns do not need to be unique subject to any rotation as some rotations are restricted and direction informa- tion may be known a priori. Although robots’ and infrastructural cameras see tags at different angles, physical constraints exist, particularly for robots. We leverage three heuristics to increase the number of “unique” tags in practice.

First, the robot moves on the ground and its view rotation is constrained. In theory, a visual tag’s view angle can be split into three parts: the angles α, β, and γ as

91 Figure 4.17: View rotations in theory.

Figure 4.17 shows. They are horizontal and vertical view directions in front of the tag surface as well as the rotated view of the tag in the clockwise direction, respectively.

Clearly γ [0, 360◦]. As visual tags only have one surface, α, β [0, 180◦]. Thus, two ∈ ∈ views with reverse left-right order of a visual tag must be mirror symmetries of the tag (Figures 4.15(e) and 4.15(f)). However, since the visual tag is placed on the wall with only one side visible, the mirror cannot exist. Hence, the visual tag always holds the same left-right order regardless of whether it is seen from the left-hand or right- hand side. Moreover, the robot cannot roll over itself (i.e., its camera view cannot be upside down). By this property, any visual tags with 180◦ rotational symmetry can be treated as unique (see Figure 4.15(a)). Namely, β 90◦ and γ 180◦ in practice. ≤ ≤ Second, we can exploit tag displacement to distinguish different tags. The cam- era is usually attached to the robot at a fixed position. By leveraging the height information, the robot can distinguish visual tags above or below its camera height.

This is because the robot moves on the ground and this motion constraint partitions the field of view of the robot’s camera into two areas: that above the camera and that below it. (This dissertation assumes that mobile robots move on level ground indoors.) Moreover, robots move continuously and their motion is much less flexible

92 than human motion. Thus, we can deploy two identical tags at two faraway places and a robot can still distinguish them based on its previous location and motion.

Third, due to their rotational asymmetry, mobile robots can distinguish visual tags based on tags’ discrete angles of rotation. The underlying reason is mobile robots’ physically constrained motion as explained previously. We rotate visual tags at 45◦ increments, which lets one visual tag pattern serve as eight different symbols based on its angle of rotation.

Visual tag manipulation and rotation constraints let us greatly increase the num- ber of unique tags in practical environments. Figure 4.18(b) shows ten tags used in our prototype when N = 2. This is much larger than the number of tags under uni- versally unique constraints. Moreover, we find the upper bound of practically unique tags satisfies the following theorem.

Theorem 6. Given N, the overall number of unique tags is 2N 2+2 8. − We prove Theorem 6 by accumulating the gain of all three approaches. Given that tag image captured by cameras can be neither vertically nor horizontally flipped, the two conditions of unique tags are not necessary any more. Thus, the number of practically unique tags is 2N 2 . However, since all white and all black tags may introduce many false positives in tag detection, we eliminate them and the number of practically unique tags is 2N 2 2. Moreover, we can place any visual tag higher − or lower than the robot’s camera height, which doubles the number of practically unique tags. Similarly, manually rotating visual tags by 45◦ also doubles the number.

In summary, the number of practically unique tags is 2 2 (2N 2 2) = 2N 2+2 8. × × − − When N = 2, the number of visual tags increases to 56. This is much larger than 3 and provides enough room and freedom to deploy visual tags for BridgeLoc.

93 C. Continuous Localization

In some real-world situations such as occlusion and poor infrastructure deploy-

ment, the robot lies outside infrastructural cameras’ views and cannot see any visual

tag. To continue localization in these cases, we incorporate the robot’s on-board

motion sensors to maintain its position. Kalman-filter-based [16] and Monte-Carlo-

based [25] approaches have been proposed to localize robots with motion sensors.

However, there are two problems with motion sensor localization: 1) its accuracy is

insufficient to rule out all candidate tags with the same visual pattern, especially when

they are very close to each other; and 2) motion sensor localization errors accumulate

over time and they cannot converge without calibration.

In the following, we develop a scheme that associates BridgeLoc’s view bridging

approach with motion sensor localization in order to estimate the robot’s position

when the bridging approach fails.

We mathematically formulate this problem as follows. Let rt denote the robot’s

T pose in the ground coordinate system at time t, where rt = (xt, yt, θt) ,(xt, yt) is the position, θt is the orientation, and t is a positive integer. o1:t denotes the vector of the robot’s observation with respect to the tag ranging from time 1 to t. c1:t is the control

sequence given to the robot from time 1 to t. We aim to calculate the conditional

probability

p(rt o1:t, c1:t). (4.10) | We calculate the probability distribution of the robot’s position given the obser-

vation of visual tags and the robot’s movement history. We solve this as a recurrence

problem using the Extended Kalman Filter (EKF) [80].

94 Next, we address practical cases: 1) If no tag is detected, which occurs at blind spots or occlusions, we cannot perform visual tag bridging and the predicted position is returned; and 2) If a tag is detected, but its pattern cannot uniquely determine

T the tag’s position, we apply the range-bearing observation model ot = (lt, φt) to the detected tag, where lt and φt are the tag’s respective distance and angle with respect to the robot. We compute the tag’s observed position (xo, yo) as

x  x  l cos (φ + θ ) o = t + t t t . (4.11) yo yt lt sin (φt + θt)

T We consider the candidate tag with the smallest distance to position (xo, yo) as the

T observed tag and obtain its real position (xm, ym) . h(ˆrt) is the observation function defined as:  p 2 2  (xm xt) + (ym yt) h(ˆrt) = − − . (4.12) atan2(ym yt, xm xt) θt − − − This is actually visual tag bridging that uses the tags’ ground positions to correct the robot’s position.

4.2.4 Implementation and Evaluation

In this section, we present our system implementation and report real-world ex- perimental results of BridgeLoc.

A. Implementation

System components: Figure 4.18(a) illustrates our prototype system. The implemen- tation of our system has two main subsystems: a server subsystem for bridging robots’ and infrastructural camera views and a mobile robot subsystem that can move au- tonomously in indoor environments as well as collect motion sensing data and send them to the server.

95 Infrared light Infrastructure Camera Visual Tag Robot Camera

Server

(a) System prototype Controller Computer

Motion control and sensing

(b) Visual Tag (c) Robot prototype

Figure 4.18: System implementation

The server is a laptop with an Intel Core i7 CPU and 8 GB RAM. We implement our bridging algorithm using C++ and OpenCV on the server. We also use Logitech

C210s as infrastructural cameras.

In addition, we design and build a robot prototype to evaluate BridgeLoc as

Figure 4.15(c) shows. Our prototype has a laptop as the central controller that issues robot motion commands and collects and sends motion sensing data. The robot’s motion system consists of an STM32 microcontroller, two DC motors with encoders, two wheels, and a caster. The STM32 microcontroller connects to the central controller via serial port. This is a differential-drive wheeled robot that has various linear and rotational velocities by applying different voltages to the two DC motors. The encoders on the motors counts the motor’s revolutions, which enables odometer functionality. As the robot’s view is limited by its height, we install a

96 commodity wide-view camera (JD-510) on it. The camera’s AoV is 120◦; it captures ∼ imagery at 640 480 resolution at 30 frames per second. We install an infrared light × atop the robot to ease infrastructural camera detection of it in the camera’s FoV. To facilitate infrastructural camera detection of the robot in the In addition, we modify infrastructural cameras to capture infrared light.

System workflow: When the system is running, the server subsystem constantly de- tects the robot’s infrared light. If detection succeeds, the server tells the robot its position. Otherwise, the server subsystem receives images sent from the robot and tries to localize it via FoV bridging. The robot tracks its position with the odome- ter and sends its images to the server as the robot’s hardware may not be capable of rapid image processing. The robot corrects its position after receiving the localization results from the server side.

B. Evaluation

We evaluate the effectiveness and accuracy of BridgeLoc in this section. We compare BridgeLoc with an approach using only infrastructural cameras (I-Cam).

We test our system with two variants: BridgeLoc and BridgeLoc Basic. Bridge-

Loc is the complete approach proposed in this dissertation. BridgeLoc Basic only applies our bridging technique without motion sensing and other advanced techniques described in Section 4.2.4. We use two metrics: localization error and success rate.

The localization error is the distance of the shift away from the ground truth position.

It only considers positions that are localized (i.e., if an approach fails to localize the robot, it will not affect this approach’s localization error). The success rate reflects the percent of localized positions over the all the test spots. For instance, if only half of a robot’s path lies in infrastructural cameras’ FoVs, its success rate is only 50%.

97 7.5 m 14m

distance

8m 8 m view angle

(a) Scenario 1 (b) Scenario 2

Figure 4.19: Scenarios for experimental evaluation.

Experimental setup: We evaluate BridgeLoc in two real-world scenarios as shown

in Figure 4.19. The first is a small office where we mainly perform benchmark tests

of BridgeLoc’s basic bridging technique. We place one visual tag in the coverage

area where its distance to the infrastructural camera is 1.5 m. We place our robot ∼ in different positions and different view angles to test the localization error. We deploy one infrastructural camera and use the Logitech C210 with a 50◦ angle of view and 6 m depth of field. The second is in a large lobby of a building with ∼ a complex background and varying illumination. We deploy three infrastructural cameras (Logitech C210s) in the lobby. The gray triangle shown in the figure indicates an occluded area from the top camera due to the corner in its FoV. We deploy 2–10 tags shown in Figure 4.18(b) to test the system performance. We perform the field test in this lobby where our robot moves along specific and random paths around the whole environment under our control.

98 50 1m 2m 3m 4m 40 30 20

Error (cm) 10 0 -60 -45 -30 0 30 45 60 View Angle (˚)

Figure 4.20: Localization error under different robot distances and view angles

Localization accuracy through view bridging: We place the robot at various distances and view angles to the visual tag in the area shown in Figure 4.19(a). The distance varies from 1 m to 4 m, where 4 m is the robot’s maximum tag detection distance.

The robot’s view angle range is [ 60◦, 60◦]. − Figure 4.20 shows the median error with regards to varying distances and angles.

At 1 m distance and 60◦ view angles, the tag is partially out of view, for which ± data are not reported. From the result, when the distance is less than 3 m and the view angle is within 45◦, the localization error is within 0.25 m, which is very low. ± However, when the distance exceeds 3 m, the error increases quickly. The reason is that the visual tag becomes very small in the robot’s view. Small visual tag detection error in the image plane causes large view mapping and localization error. When the view angle exceeds 45◦, the tag appears at the edge of the robot view. The ± distortion effect deforms the tag, hence increasing the error. Although we calibrate the camera to counter distortion, some error still occurs. This validates the effectiveness

99 Ground truth Ground truth BRIDGELOC BRIDGELOC Area without Area without 14m visual coverage 14m visual coverage

8m 8m

(a) Using 4 tags (b) Using 10 tags

Figure 4.21: Ground truth (solid lines) and localized path (dashed lines) of two sample traces in field test.

of BridgeLoc in expanding the localization range and reasonable submeter accuracy

(comparable with vision-based localization).

Number of tags: We further examine the impact of tag deployment. Figure 4.21 gives two sample robot paths in the field test area using 4 tags and 10 tags, respectively. The ground truth is marked by solid lines and the path learned by BridgeLoc is shown

in red dashed lines. When the robot is outside the infrastructural cameras’ FoVs,

I-Cam cannot localize the robot. For two examples, we can see that BridgeLoc

is fairly accurate and effective in enlarging the coverage areas. The accuracy and

success rate using 10 tags is slightly better than the one using 4 (fewer) tags. This is

also observed in the the following test.

Field test: We perform more runs to test the effectiveness of BridgeLoc. The robot

moves randomly in the area, where the number of deployed tags varies from 2 to

10. They are deployed higher or lower than the robot camera. Figures 4.22 and

4.23 shows the localization success rate and average localization error under different

100 100 I-Cam BRIDGELOC Basic 75 BRIDGELOC

50 Success 25 rate (%)

0 2 4 6 8 10 Number of visual tags

Figure 4.22: Localization success rate under different number of tags in the environ- ment

80 I-Cam BRIDGELOC Basic 60 BRIDGELOC

40 error (cm)

Localization 20

0 2 4 6 8 10 Number of visual tags

Figure 4.23: Average localization error under different number of tags in the environ- ment

101 numbers of visual tags deployed in the environment. Experimental results show that our system can achieve submeter accuracy (localization error 0.5 m) when there ≤ are 4 or more tags deployed. When the number of tags increases from 2 to 10, the average localization error drops by 54%. Visual tags also significantly improve the localization coverage area. Without visual tags, infrastructural cameras can only successfully localize 49% of robot path. With the help of 8 visual tags, the success rate exceeds 90% and 10 visual tags essentially achieves full coverage.

Our system is also resilient to low numbers of visual tags to a certain degree.

In coverage blind spots, our system can adopt motion sensing to track itself until it is localized by infrastructural cameras or visual tags. However, due to accumulated error of motion sensing, localization error increases when coverage of cameras and visual tags is very limited. This supports our idea that localization accuracy can be improved significantly by deploying lightweight visual tags in the environment.

102 Ξ-XYZ A Cartesian (X,Y,Z) coordinate system with Ξ being the original point. In this work, Ξ denotes ground (G), camera (C), and robot (R). (Ξ) xo The object o’s coordinate on the X-axis (XΞ). o can be any point in the visual tag, camera (its centroid), and the robot to be located. (Ξ) yo The object o’s coordinate on the Y-axis (YΞ). (Ξ) zo The object o’s coordinate on the Z-axis (ZΞ). (Ξ) (Ξ) (Ξ) (Ξ) (Ξ) T Vo Vo = [xo , yo , zo ] , o’s position vector in the Ξ-coordinate system. (Ξ1→Ξ2) θx The rotation angle from Ξ1-coordinate system to the Ξ2-coordinate system projected to the X- axis. (Ξ1→Ξ2) θy The rotation angle from the Ξ1-coordinate sys- tem to the Ξ2-coordinate system projected to the Y-axis. (Ξ1→Ξ2) θz The rotation angle from the Ξ1- coordinate sys- tem to the Ξ2-coordinate system projected to the Z-axis. (Ξ →Ξ ) (Ξ →Ξ ) (Ξ1→Ξ2) (Ξ1→Ξ2) (Ξ1→Ξ2) Θ 1 2 Θ 1 2 = (θx , θy , θz ) , the rotation from the Ξ1- coordinate system to the Ξ2- coordinate system. (u, v)(Ξ) The 2D coordinates in Ξ’s image plane where Ξ can be the infrastructural camera or the robot’s camera. T AG(k) Code of visual tag k. N The number of code blocks in one dimension of a visual tag. b The ratio of the visual tag code area width to the border width. rt Robot’s pose vector in ground coordinate system at time t ot Robot’s observation vector at time t ct Robot’s control vector (using motion sensors) at time t g( ) Robot’s control function · h( ) Robot’s observation function ·

Table 4.3: Notation.

103 4.3 Summary

In this chapter, we have presented the S-Mirror system and BridgeLoc system.

In small scale environments, S-Mirror is a novel mobile robot tracking system that

“reflected” various ambient signals toward mobile robots, expanding their sensing abilities. S-Mirror infrastructure included S-Mirror nodes that mainly reflected visual signals as well as electronic and auditory ones. We developed a localization approach for accurate and timely mobile robot localization that integrated S-Mirror nodes and mobile robots’ motion sensors in an end-to-end manner. We implemented S-

Mirror system and a mobile robot prototype on COTS hardware. Our experimental evaluation of S-Mirror showed that it achieved accurate localization with low network bandwidth as well as robustness and scalability to many mobile robots.

Then, we presented a large scale mobile robot tracking system that bridges mobile robots’ cameras and infrastructural cameras based on low-cost visual tags in order to enable accurate scalable robot localization. We presented techniques for increasing the number of possible visual tags using physically constrained robot motion. We described a probabilistic technique that assists robot localization when robots are in blind spots of infrastructural camera coverage. We implemented BridgeLoc and a mobile robot prototype using COTS hardware and software. Experimental evaluation of our system showed its promise for accurate, scalable, and cost-effective robot localization.

104 Chapter 5: Final Remarks

In this dissertation, we studied human and mobile robot tracking in small and

large scale environments. As humans and mobile robots differ in appearance, mobil-

ity, computation and sensing capability, and feasibility of cooperation with tracking

systems, we first explored human tracking in environments with different scales. In

small scale environments, accuracy and real-time performance are two important

performance metrics for human tracking systems. We proposed VM-Tracking, one

such system that achieved accurate, real-time tracking by integrating visual camera

data (V ) with motion sensor data (M ) from humans’ mobile devices that they carry everywhere. We proposed an appearance-free tracking algorithm and a physical lo- cation based VM fusion algorithm to track visual human objects with the assistance of motion sensors in humans’ devices. Our algorithms significantly mitigated the de- pendency of long trajectories, the high computational overhead of video processing, and the occlusion problem, thereby providing continuous, accurate tracking results.

We conducted comprehensive experiments based on large scale real-world implementa- tion. The results showed our system’s superior performance in terms of time efficiency and tracking accuracy. In large scale human tracking systems, accurately determin- ing humans’ identities and locations amidst different scenes is a key challenge. We

105 presented EV-Matching, a large scale human tracking system that matches hetero-

geneous E and V data according to spatiotemporal information at large scale. In particular, we focused on addressing practical challenges of noisy data and complex environments via three key algorithms: set splitting, filtering, and match refining.

We implemented the distributed algorithms on Spark and tested it on a large syn- thetic EV dataset. The evaluation results demonstrated our algorithms’ feasibility and efficiency.

Next, we studied mobile robot tracking in small and large scale environments.

We presented S-Mirror, a small scale mobile robot tracking system that “reflected” various (mainly visual) ambient signals towards mobile robots and greatly extended their sensing abilities. With sensing signals from S-Mirror, mobile robots accurately tracked themselves and shared the computational burden from the infrastructure in a distributed manner. We implemented S-Mirror system and a mobile robot prototype on commercial off-the-shelf (COTS) hardware. Our experimental evaluation of S-

Mirror showed that it achieved accurate localization with low network bandwidth as well as robustness and scalability for many mobile robots. To track mobile robots in large scale environments, we proposed BridgeLoc that bridged mobile robots’

cameras and infrastructural cameras based on low-cost visual tags in order to enable

accurate, scalable robot tracking. We presented techniques for increasing the number

of possible visual tags using physically constrained robot motion. We also described a

probabilistic technique that assists robot localization when robots are in blind spots

of infrastructural camera coverage. We implemented our system and a mobile robot

prototype using COTS hardware and software. Experimental evaluation of our system

106 showed its promise for accurate, scalable, and cost-effective robot tracking in large scale environments.

However, several directions remain for future work in human and mobile robot tracking. We plan to extend our existing work as follows:

Accurate and real-time human tracking without human cooperation: VM-Tracking requires humans’ active, lightweight cooperation to collect motion sensing data from their mobile devices. Similar to EV-Matching, their devices constantly transmit elec- tronic signals in a passive manner. Thus, we can also develop a real-time human tracking system using both visual and electronic signals. However, in small scale environments, electronic signals’ received signal strength indicators (RSSIs) can only indicate humans’ rough locations. How to accurately match E and V data to track humans is a key problem that we will address in future work.

Testing and improvement of EV-Matching: Due to time and resource limitations, we conduct EV-Matching experiments on a large synthetic EV dataset whose V data come from a real-world dataset [47]. Although evaluations with the synthetic dataset illustrate our algorithms’ feasibility and efficiency, evaluations based on complete real- world data may lead to further system improvements. In future work, we plan to set up a large scale real-world experiment that collects E and V data simultaneously and tests our EV-Matching system using these data.

Sensing and service extension of S-Mirror: In the future, we envision that S-

Mirror will serve as a public service similar to public WiFi infrastructure. Such a service may offer other functionality in addition to mobile robot tracking. Naturally, this entails S-Mirror’s “reflection” of heterogeneous sensing data such as acoustic

107 and electronic signals. Accordingly, we plan to extend S-Mirror’s functionality while

incorporating other types of sensors.

Visual tag deployment optimization: For BridgeLoc, an optimal visual tag de-

ployment should maximize the localization coverage area while minimizing the num-

ber of required visual tags. In current work, we heuristically propose tags’ deploy-

ment along the border of cameras’ FoVs in order to maximize the view-bridging area.

However, there are more facets of visual tag deployment. Some visual tags may be

deployed outside of infrastructural cameras’ views and serve as “anchors” for robot

tracking and navigation. Although their ground locations are unknown, they can still

inform the robot’s movement via displacement of relative locations. The design of

optimal deployment schemes is part of our future work.

Besides improvements and extensions of existing work, we plan to explore the

following new directions in order to realize the potential of the human-robot world:

General object tracking: Although humans and mobile robots play two major roles

in this world, there are many other objects such as animals and cars that coexist

alongside humans and mobile robots. These objects need to be tracked as well due

to their strong interactions with humans and mobile robots. Thus, we plan to study

general object tracking which is applicable for various entities in the human-robot

world. Some ideas and techniques discussed in this dissertation are potentially useful

for future work on general object tracking.

Human-robot interaction: Besides tracking, many other technologies are required to enable natural robotic interaction with humans. In particular, human behavior detection and facial expression recognition are two important approaches for robots to “understand” humans’ states of mind. In addition, speech recognition and natural

108 language processing are two crucial technologies for natural communication between humans and mobile robots. Recent breakthroughs in deep learning, part of the field of artificial intelligence, offer promise for future work to realize the human-robot world fully.

109 Bibliography

[1] Indooratlas. http://www.indooratlas.com. [2] World robotics 2015 service robots, 2016. http://www.ifr.org/service- robots/statistics/. [3] World robotics 2016 service robots, 2017. https://ifr.org/ifr-press- releases/news/world-robotics-report-2016. [4] Apache Spark. http://spark.apache.org. [5] C. Arora and A. Globerson. Higher order matching for consistent multiple target tracking. In ICCV, 2013. [6] S.-H. Baeg, J.-H. Park, J. Koh, K.-W. Park, and M.-H. Baeg. Building a smart home environment for service robots based on rfid and sensor networks. In Control, Automation and Systems, 2007. ICCAS’07. International Conference on, pages 1078–1082. IEEE, 2007. [7] X. Bai, D. Xuan, Z. Yun, T. H. Lai, and W. Jia. Complete optimal deployment patterns for full-coverage and k-connectivity (k 6) wireless sensor networks. In Proceedings of the 9th ACM international symposium≤ on Mobile ad hoc net- working and computing, pages 401–410. ACM, 2008. [8] N. Banerjee, S. Agarwal, P. Bahl, R. Chandra, A. Wolman, and M. D. Corner. Virtual compass: Relative positioning to sense mobile social interactions. In Pervasive ’10, pages 1–21, 2010. [9] A. Bedagkar-Gala and S. K. Shah. A survey of approaches and trends in person re-identification. Image and Vision Computing, 32(4):270–286, 2014. [10] M. Betke and L. Gurvits. Mobile Robot Localization Using Landmarks. IEEE Trans. Robot. Autom., 13:251–263, Apr. 1997. [11] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online multiperson tracking-by-detection from a single, uncalibrated camera. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(9):1820– 1833, 2011.

110 [12] W. Cai, C. Li, and S. Luan. Soi rf switch for wireless sensor network. arXiv preprint arXiv:1701.01763, 2017.

[13] W. Cai, J. , and S. Wang. Low power si class e power amplifier for healthcare application. International Journal of Electronics Communication and Computer Engineering, 7(6):290, 2016.

[14] W. Cai, X. Zhou, and X. Cui. Optimization of a gpu implementation of multi- dimensional rf pulse design algorithm. In Bioinformatics and Biomedical Engi- neering,(iCBBE) 2011 5th International Conference on, pages 1–4. IEEE, 2011.

[15] T. Camp, J. Boleng, and V. Davies. A survey of mobility models for ad hoc network research. Wireless communications and mobile computing, 2(5):483– 502, 2002.

[16] H. Chung, L. Ojeda, and J. Borenstein. Accurate mobile robot dead-reckoning with a precision-calibrated fiber-optic . Robotics and Automation, IEEE Transactions on, 17(1):80–84, 2001.

[17] J. Dai, X. Bai, Z. Yang, Z. Shen, and D. Xuan. Mobile phone-based pervasive fall detection. Personal and , 14(7):633–643, 2010.

[18] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, volume 1, pages 886–893. Ieee, 2005.

[19] A. Del Bimbo, F. Dini, G. Lisanti, and F. Pernici. Exploiting distinctive visual landmark maps in pan–tilt–zoom camera networks. Computer Vision and Image Understanding, 114(6):611–623, 2010.

[20] E. DiGiampaolo and F. Martinelli. Mobile robot localization using the phase of passive uhf rfid signals. IEEE Transactions on Industrial Electronics, 61(1):365– 376, 2014.

[21] M. Donoser and D. Schmalstieg. Discriminative feature-to-point matching in image-based localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 516–523, 2014.

[22] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.

[23] M. Fiala. Artag, a fiducial marker system using digital techniques. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recogni- tion (CVPR’05), volume 2, pages 590–596. IEEE, 2005.

111 [24] A. Filippeschi, N. Schmitz, M. Miezal, G. Bleser, E. Ruffaldi, and D. Stricker. Survey of motion tracking methods based on inertial sensors: A focus on upper limb human motion. Sensors, 17(6):1257, 2017.

[25] D. Fox, W. Burgard, F. Dellaert, and S. Thrun. Monte carlo localization: Effi- cient position estimation for mobile robots. AAAI/IAAI, 1999:343–349, 1999.

[26] E. Foxlin. Pedestrian tracking with shoe-mounted inertial sensors. Computer Graphics and Applications, IEEE, 25(6):38–46, 2005.

[27] E. Foxlin, L. Naimark, et al. Vis-tracker: A wearable vision-inertial self-tracker. VR, 3:199, 2003.

[28] foxnews. Security camera surge in Chicago sparks concerns of ’massive surveil- lance system’, 5 Nov. 2014. http://www.foxnews.com/.

[29] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rend´on-Mancha. Visual si- multaneous localization and mapping: a survey. Artificial Intelligence Review, 43(1):55–81, 2015.

[30] H.-M. Gross, H. Boehme, C. Schroeter, S. M¨uller,A. K¨onig,E. Einhorn, C. Mar- tin, M. Merten, and A. Bley. Toomas: interactive shopping guide robots in everyday use-final implementation and experiences from long-term field trials. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2005–2012. IEEE, 2009.

[31] S. Hanoun, A. Bhatti, D. Creighton, S. Nahavandi, P. Crothers, and G. Carroll. Task assignment in camera networks: A reactive approach for manufacturing environments. IEEE Systems Journal, 2016.

[32] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cam- bridge university press, 2003.

[33] M. S. Hassan, A. F. Khan, M. W. Khan, M. Uzair, and K. Khurshid. A compu- tationally low cost vision based tracking algorithm for human following robot. In Control, Automation and Robotics (ICCAR), 2016 2nd International Conference on, pages 62–65. IEEE, 2016.

[34] W. Hess, D. Kohler, H. Rapp, and D. Andor. Real-time loop closure in 2d lidar slam. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1271–1278. IEEE, 2016.

[35] T. . Surveillance video: The biggest big data. IEEE Computer Soci- ety[Online], 7(2), 2014.

112 [36] W. Huang, Y. Xiong, X.-Y. Li, H. , X. Mao, P. Yang, and Y. . Shake and walk: Acoustic direction finding and fine-grained indoor localization using smartphones. In INFOCOM, 2014 Proceedings IEEE, 2014.

[37] T.-W. Kan, C.-H. Teng, and W.-S. Chou. Applying qr code in augmented real- ity applications. In Proceedings of the 8th International Conference on Continuum and its Applications in Industry, pages 253–257. ACM, 2009.

[38] M. Kobilarov, G. Sukhatme, J. Hyams, and P. Batavia. People tracking and following with mobile robot using an omnidirectional camera and a laser. In Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE Interna- tional Conference on, pages 557–562. IEEE, 2006.

[39] Y. Kuniyoshi, J. Rickki, M. Ishii, S. Rougeaux, N. Kita, S. Sakane, and M. Kakikura. Vision-based behaviors for multi-robot cooperation. In Intelli- gent Robots and Systems’ 94.’Advanced Robotic Systems and the Real World’, IROS’94. Proceedings of the IEEE/RSJ/GI International Conference on, vol- ume 2, pages 925–932. IEEE, 1994.

[40] Y. Lao, J. Zhu, and Y. F. Zheng. Sequential particle generation for visual tracking. Circuits and Systems for Video Technology, IEEE Transactions on, 19(9):1365–1378, 2009.

[41] S. J. Leon. Linear algebra with applications. Macmillan New York, 1980.

[42] J. J. Leonard and H. F. Durrant-Whyte. Mobile robot localization by tracking geometric beacons. IEEE Transactions on robotics and Automation, 7(3):376– 382, 1991.

[43] F. Li, C. Zhao, G. Ding, J. Gong, C. Liu, and F. Zhao. A reliable and accurate indoor localization method using phone inertial sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pages 421–430. ACM, 2012.

[44] G. Li. A Holistic Study on Electronic and Visual Signal Integration for Efficient Surveillance. The Ohio State University, 2017.

[45] G. Li, F. Yang, G. Chen, Q. Zhai, X. Li, J. Teng, J. Zhu, D. Xuan, B. Chen, and W. Zhao. Ev-matching: Bridging large visual data and electronic data for efficient surveillance. In Distributed Computing Systems (ICDCS), 2017 IEEE Conference on. IEEE, 2017.

[46] M. Li, B. H. Kim, and A. I. Mourikis. Real-time motion tracking on a cellphone using inertial sensing and a rolling- camera. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 4712–4719. IEEE, 2013.

113 [47] W. Li and X. Wang. Locally aligned feature transforms across views. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3594– 3601. IEEE, 2013.

[48] X. Li, J. Teng, Q. Zhai, J. Zhu, D. Xuan, Y. F. Zheng, and W. Zhao. Ev-human: Human localization via visual estimation of body electronic interference. In Proc. of IEEE INFOCOM Mini, 2013.

[49] G. Ligorio and A. M. Sabatini. A novel kalman filter for human motion tracking with an inertial-based dynamic inclinometer. IEEE Transactions on Biomedical Engineering, 62(8):2033–2043, 2015.

[50] T. Lindhorst, G. Lukas, and E. Nett. Wireless mesh network infrastructure for industrial applications – a case study of tele-operated mobile robots. In Proc. IEEE ITFA, 2013.

[51] K. Liu, X. Liu, and X. Li. Guoguo: Enabling fine-grained indoor localization via smartphone. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’13, 2013.

[52] F. , W. Zhang, N. Lu, and M.-m. Song. The environmental cognition and agilely service in home service robot intelligent space based on multi-pattern information model and zigbee wireless sensor networks. In Networking, Sensing and Control (ICNSC), 2014 IEEE 11th International Conference on, pages 273– 278. IEEE, 2014.

[53] Y.-A. Lu and G.-S. Huang. Positioning and navigation of meal delivery robot using magnetic sensors and rfid. In Computer, Consumer and Control (IS3C), 2014 International Symposium on, pages 808–811. IEEE, 2014.

[54] V. Lui and T. Drummond. Image based optimisation without global consistency for constant time monocular visual slam. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 5799–5806. IEEE, 2015.

[55] C. Lundberg, R. Reinhold, and H. I. Christensen. Evaluation of robot deploy- ment in live missions with the military, police, and fire brigade. In Defense and Security Symposium, pages 65380R–65380R. International Society for Optics and Photonics, 2007.

[56] Y. Ma, K. Pahlavan, and Y. . Comparative behavioral modeling of poa and toa ranging for location-awareness using rfid. International Journal of Wireless Information Networks, 3(23):187–198, 2016.

[57] MarketsandMarkets. Video Surveillance Systems & Services Market - Analysis & Forecast (2013 - 2018). http://goo.gl/MdmIvX, 2013.

114 [58] E. Martinez-Martin and A. P. Del Pobil. Robust motion detection and track- ing for human-robot interaction. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, pages 401– 402. ACM, 2017.

[59] V. P. Munishwar, V. Kolar, and N. B. Abu-Ghazaleh. Coverage in visual sensor networks with pan-tilt-zoom cameras: the maxfov problem. In Proc. of IEEE INFOCOM, 2014.

[60] R. Mur-Artal, J. Montiel, and J. D. Tard´os.Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.

[61] M. Oca˜na,L. M. Bergasa, M. A. Sotelo, J. Nuevo, and R. Flores. Indoor Robot Localization System Using WiFi Signal Measure and Minimizing Calibration Effort. In Proc. IEEE ISIE, 2005.

[62] E. Olson. Apriltag: A robust and flexible visual fiducial system. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 3400–3407. IEEE, 2011.

[63] C. Peng, G. Shen, Y. Zhang, Y. Li, and K. Tan. Beepbeep: A high accuracy acoustic ranging system using cots mobile devices. In Proc. ACM SenSys, pages 1–14. ACM, 2007.

[64] Pinhole Camera Model. https://en.wikipedia.org/wiki/Pinhole_camera_ model. [65] J. Pramis. Number of mobile phones to exceed world population by 2014, 2Nov. 2013. http://www.digitaltrends.com/mobile/mobile-phone-world-population- 2014/.

[66] S. Ren, Q. Li, H. Wang, X. Chen, and X. Zhang. A study on object track- ing quality under probabilistic coverage in sensor networks. ACM SIGMOBILE Mobile Computing and Communications Review, 9(1):73–76, 2005.

[67] A. Richardson and E. Olson. Pas: Visual odometry with perspective alignment search. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1053–1059. IEEE, 2014.

[68] D. Risti´c-Durrant, G. Gao, and A. Leu. Low-level sensor fusion-based human tracking for mobile robot. Facta Universitatis, Series: Automatic Control and Robotics, 1(1):17–32, 2016.

[69] D. Roetenberg, P. J. Slycke, and P. H. Veltink. Ambulatory position and ori- entation tracking fusing magnetic and inertial sensing. Biomedical Engineering, IEEE Transactions on, 54(5):883–890, 2007.

115 [70] H. C. R. Roh, C. H. Sung, M. T. , and M. J. Chung. Point Pattern Matching Based Visual Global Localization using Ceiling Lights. In Proc. IEEE ASCC, 2011.

[71] D. Ross, J. Lim, and M.-H. Yang. Adaptive probabilistic visual tracking with incremental subspace update. In Computer Vision-ECCV 2004, pages 470–482. Springer, 2004.

[72] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization using direct 2d-to-3d matching. In 2011 International Conference on Computer Vision, pages 667–674. IEEE, 2011.

[73] S. Sen, B. Radunovic, R. R. Choudhury, and T. Minka. You are facing the mona lisa: spot localization using phy layer information. In Proceedings of the 10th international conference on Mobile systems, applications, and services, pages 183–196. ACM, 2012.

[74] Y. Shen, W. Hu, J. Liu, M. Yang, B. Wei, and C. T. Chou. Efficient background subtraction for real-time tracking in embedded camera networks. In Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems, pages 295– 308. ACM, 2012.

[75] P. Sturm. Pinhole camera model. In K. Ikeuchi, editor, Computer Vision: A Reference Guide, pages 610–613. Springer, Boston, MA, 2014.

[76] T. Teixeira, D. Jung, and A. Savvides. Pem-id: Identifying people by gait- matching using cameras and wearable . In Proceedings of Inter- national conference on Distributed Smart Cameras. ACM, 2009.

[77] T. Teixeira, D. Jung, and A. Savvides. Tasking networked cctv cameras and mobile phones to identify and localize multiple people. In Proceedings of the 12th ACM international conference on Ubiquitous computing, pages 213–222. ACM, 2010.

[78] J. Teng, J. Zhu, B. Zhang, D. Xuan, and Y. F. Zheng. Ev: efficient visual surveillance with electronic footprints. In INFOCOM, 2012 Proceedings IEEE, pages 109–117. IEEE, 2012.

[79] S. Thrun. Bayesian Landmark Learning for Mobile Robot Localization. Mach. Learn., 33:41–76, 1998.

[80] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics. MIT press, 2005.

[81] G. Tian. Wide future for home service robot research. International Academic Developments, 1:28–29, 2007.

116 [82] R. Vidal, O. Shakernia, and S. Sastry. Formation control of nonholonomic mobile robots with omnidirectional visual servoing and motion segmentation. In Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, volume 1, pages 584–589. IEEE, 2003.

[83] Wikipedia. Angle of view. https://en.wikipedia.org/wiki/Angle_of_view.

[84] Y. , J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Proc. of CVPR, pages 2411–2418. IEEE, 2013.

[85] J. Xiong and K. Jamieson. Arraytrack: A fine-grained indoor location system. In NSDI, 10th USENIX Symposium on Networked Systems Design and Imple- mentation, 2013.

[86] E. Yang, J. Gwak, and M. Jeon. Multi-human tracking using part-based appear- ance modelling and grouping-based tracklet association for visual surveillance applications. Multimedia Tools and Applications, 76(5):6731–6754, 2017.

[87] F. Yang, Q. Zhai, G. Chen, A. C. Champion, J. Zhu, and D. Xuan. Flash-loc: Flashing mobile phones for accurate indoor localization. In Computer Communi- cations, IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on, pages 1–9. IEEE, 2016.

[88] Z. Yang, C. Wu, and Y. Liu. Locating in fingerprint space: wireless indoor localization with little human intervention. In Proceedings of the 18th annual international conference on Mobile computing and networking, pages 269–280. ACM, 2012.

[89] K.-T. Yu, C.-P. Lam, M.-F. , W.-H. , S.-H. Tseng, and L.-C. Fu. An interactive robotic walker for assisting elderly mobility in senior care unit. In 2010 IEEE Workshop on Advanced Robotics and its Social Impacts, pages 24–29. IEEE, 2010.

[90] S.-I. Yu, Y. Yang, and A. Hauptmann. Harry potter’s marauder’s map: Local- izing and tracking multiple persons-of-interest by nonnegative discretization. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3714–3720. IEEE, 2013.

[91] Q. Zhai, S. Ding, X. Li, F. Yang, J. Teng, J. Zhu, D. Xuan, Y. F. Zheng, and W. Zhao. Vm-tracking: Visual-motion sensing integration for real-time human tracking. In Computer Communications (INFOCOM), 2015 IEEE Conference on, pages 711–719. IEEE, 2015.

[92] Q. Zhai, F. Yang, A. Champion, C. Peng, J. Zhu, D. Xuan, B. Chen, Y. F. Zheng, and W. Zhao. S-mirror: Mirroring sensing signals for mobile robots in

117 indoor environments. In Mobile Ad-Hoc and Sensor Networks (MSN), 2016 12th International Conference on, pages 301–305. IEEE, 2016.

[93] B. Zhang, J. Teng, J. Zhu, X. Li, D. Xuan, and Y. F. Zheng. EVLoc: Integrating electronic and visual signals for accurate localization. In Proc. of ACM MobiHoc, 2012.

[94] J. Zhang and S. Singh. Visual-lidar odometry and mapping: Low-drift, robust, and fast. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 2174–2181. IEEE, 2015.

[95] Z. Zhang, X. Zhou, W. Zhang, Y. Zhang, G. Wang, B. Y. Zhao, and H. Zheng. I am the antenna: accurate outdoor AP location using smartphones. In Proceed- ings of ACM MobiCom, pages 109–120, 2011.

118