MULTI-CAMERA VEHICLE TRACKING USING VIDEO ANALYTICS

J.KAVITHA1,MRS K.BALA2, DR.D.RAJINIGIRINATH3, 1 P.G Student, 2Asst.Prof, 3Professor, Department of Computer Science and Engg, Sri Muthukumaran Institute Of Technology, Chikkarayapuram, Chennai, Tamil Nadu-600069, (TN) INDIA

Abstract—Recently video-enabled camera Index Terms—Cyber-physical Systems, will be the most pervasive type of sensor. Video Analytics, Vehicle Trajectory Such cameras will enable continuous Inference, Heterogeneous Camera surveillance through heterogeneous camera Network networks consisting of fixed camera systems as well as cameras on mobile devices. The challenge in these networks is to enable efficient video analytics: the I. INTRODUCTION ability to process videos cheaply and quickly to enable searching for specific Video cameras will soon be, if they are not events or sequences of events. In this already, the most pervasive type of sensor in paper, we discuss the design and the Internet of Things. They are ubiquitous in implementation of Kestrel, a video public places, available on mobile devices and analytics system that tracks the path of drones, in cars as dashcams, and on security vehicles across a heterogeneous camera personnel as bodycams. Such cameras are network. In Kestrel, fixed camera feeds valuable because they provide rich contextual are processed on the cloud, and mobile information, but pose a challenge for the very devices are invoked only to resolve same reason: it requires significant manual ambiguities in vehicle tracks. Kestrel’s effort to extract meaningful semantic mobile device pipeline detects objects using information from these videos. To address a deep neural network, extracts attributes this, recent research [20] has started exploring using cheap visual features, and resolves the design of video analytics systems, which path ambiguities by careful association of automatically process videos in order to vehicle visual descriptors, while using extract semantic information. heterogeneous several optimizations to conserve energy Camera Networks: Future video analytics and reduce latency. Our evaluations show systems will need to operate on heterogeneous that Kestrel can achieve precision and camera networks. These networks consist of recall comparable to a fixed camera fixed camera surveillance systems which can network of the same size and topology. be engineered such that camera feeds can be J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 1P a g e

transmitted to a cloud-based processing camera), and the system returns the sequence system [20] also more constrained camera of cameras (i.e., the path through the camera devices (mobile cameras, dash cams and body network) at which this vehicle was seen. This cams) that may be wirelessly connected, have system carefully selects, and appropriately limited rocessing power, and, in some cases, adapts, the right combination of vision may be battery powered. The advantage of algorithms to enable accurate end-toend heterogeneous camera networks is that they tracking, even when individual components can greatly increase accuracy and coverage. can haveless than perfect accuracy, while This is, by now, well established even in the respecting mobile device constraints. Kestrel public sphere. After the success in using addresses the challenge described above using mobile phone images and videos in the Boston one key architectural idea (Figure 1). Its cloud marathon bombing a few years ago In pipeline continuously processes videos particular, cities like Chicago [11] also have streamed from fixed cameras to extract dash cams for recording traffic stops. vehicles and their attributes, and when a query is posed,computes vehicle trajectories across Video Analytics for Vehicle Tracking: As a first these fixed cameras. Only when there is step to understanding how to design video ambiguity in these trajectories, Kestrel queries analytics systems in heterogeneous camera one or more mobile devices (which runs a networks, we take a specific problem: to mobile pipeline optimized for video automatically detect a path taken by a vehicle processing on mobile devices). Thus, mobile through a heterogeneous network of non- devices are invoked only when absolutely overlapping cameras. In our problem setting, necessary, significantly reducing resource each mobile or fixed camera either consumption on mobile devices. continuously or intermittently captures video, together with associated metadata (camera Kestrel uses novel techniques for fast execution location, orientation, resolution, etc.). of deep (those with 20 or more layers) Conceptually, this creates a large corpus of Convolutional Neural Networks(CNNs) on videos over which users may pose several mobile device embedded GPUs (§II-A). These kinds of queries either in near real-time, or deep CNNs are more accurate, but mobile after the fact, the most general of which is: GPUs cannot execute these deep CNNs What path did a given car, seen at a given because of insufficient memory. By camera, take through the camera network? quantifying the memory sumption of each Commercial surveillance systems do not layer, we have designed a novel approach that support automated multi-camera vehicle offloads the bottleneck layers to the mobile tracking, nor heterogeneous camera network. device’s CPU and pipelines frame processing They either require a centralized collection of without impacting the accuracy. Kestrel videos [3], or perform object detection on an leverages these optimizations as an essential individual camera, leaving it to a human building block to run a deep CNN ( [21]) on operator to perform association of vehicles mobile GPUs for detecting objects and across cameras by manual inspection [1]. drawing bounding boxes around them on a frame. Contributions: This paper describes the design of Kestrel1 (§II), a video analytics system for Kestrel employs an accurate and efficient object vehicle tracking. Users of Kestrel provide an tracker for objects within a single video (§II- image of a vehicle (captured, for example, by B), enabling an order of magnitude more the user inspecting video from a static efficiency than per-frame object detection. J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 2P a g e

This tracker runs on the mobile device, and we descriptors for each car (attribute use it to extract object attributes, like the extraction). Then, across nearby cameras, it direction of motion as well as object features tries to match cars (crosscamera for matching. Existing general purpose association), and uses these to infer the path trackers are computationally expensive, or can the car took (path inference) only on videos only track single objects.Kestrel only from the fixed cameras.Only when the cloud periodically detects objects (thereby reducing determines that it has low confidence in the the energy cost of object detection), say every result, it uses metadata from the mobile k video frames, so that the tracking algorithm devices to query relevant mobile devices to has to only be accurate in between frames at increase tracking accuracy. Once the mobile which detection is applied. device receives the query, it performs roughly the same steps, but only in a small segment of the captured video (and some of these steps are optimized to save energy). More specifically, the cloud sends to the mobile device its inferred candidate paths, II. KESTREL DESIGN and the mobile pipeline re-ranks these paths. The outcome is increased confidence in the Figure 1 shows the system estimated path. In the following sections, we architecture of Kestrel, which consists of a describe Kestrel’s components: some mobile device pipeline and a cloud pipeline. components run on the cloud alone, others run on both the cloud and the mobile device.

A. Object Detection: Deep CNNs

Kestrel uses GPUs on mobile devices to run Convolutional Neural Networks (CNNs) for vehicle detection. Many mobile devices incorporate GPUs (like nVidia’s TK1 and TX1 or ’s ), including mobile phones such as Phab 2 Pro [25] and Asus ZenFone AR [2], as well as tablets like the C [16], the Google’s Videos from fixed cameras are streamed and [15]. Unfortunately, the memory stored on a cloud. Mobile devices store requirements of deep CNNs are outpacing videos locally and only send metadata to the the growth in mobile GPU memory. The cloud specifying when and where the video trend today is towards deeper CNNs (e.g., is taken. Given a vehicle to track through the ResNet has 152 layers), with increasing camera network, the cloud detects cars in memory footprint. Kestrel carefully each frame (object detection), determines the optimizes CNN memory footprint to enable direction of motion of each car (tracking) in state-of-the-art object detection on mobile each single camera view and extracts visual GPUs. J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 3P a g e

The CPU offload Optimization: Prior work Light-weight Multi-Object Tracking: To be has considered offloading computation to able to track objects2 in a video captured at a overcome processor limitations [8, 18]. We single camera, Kestrel needs to take objects explore offloading only the FC layer detected in successive frames and associate computation to the CPU, because, since the them across multiple frames of the video. Our CPU supports virtual memory it can more tracking algorithm has two functions: (i) it can effectively manage memory over- reduce the number of times YOLO is invoked, subscription required for the FC layer. With which in turn reduces the latency and CPU offloading, we observe that when the conserves energy, and (ii) it can be used to CPU is running the FC layer on the first estimate direction of motion. The state-of-the- video frame,the GPU cores are idle. So we art object tracking techniques [ 13] take as adopt a pipelining strategy to start running input the pre-computed bounding box of the the second video frame on the GPU cores object, then extract sophisticated features like rather than letting them idle. To achieve this SIFT [9], SURF [17], etc.from the bounding kind of pipelining, we run the FC layers on box, and then iteratively infer the position of the CPU on a separate thread, as GPU these key-points in subsequent frames using computation is managed by the main thread. feature matching algorithms like RANSAC .

Other memory optimizations: In our evaluation (§III-E), we compare offloading with pipelining with other memory optimizations that have been proposed in the literature. We explore reducing the size of the FC layer ([12] has explored similar techniques in different settings), which can potentially reduce detection accuracy. To reduce the size of the FC layer, obtain the output vector. Splitting this matrix multiplication and reading the weight matrix in parts allows us to reuse Object State Extraction: An object seen in a single camera can go through different memory.However, re-using memory and overwriting the used chunks incurs additional states: it can enter the scene,leave the scene, file access overheads. move or be stationary. Determining these states is important to filter out uninteresting B. Attribute Extraction objects like permanently parked cars, etc. Kestrel differentiates moving objects from After detecting objects, Kestrel extracts two stationary attributes of the object from a video, including its (i) direction of motion, and (ii) a low- ones by detecting the difference of the optical complexity visual descriptor. These attributes flow in the bounding boxes and in the are used to associate objects across cameras background. For moving objects, the state of (§II-C). To estimate the direction of motion, it the object changes from entering, moving, to is important to track the object across exit, as it moves through the camera scene. A successive video frames. moving object can become stationary in the scene, a stationary object can also start moving at any time. For example, a car may J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 4P a g e come to a stop sign or a traffic light at an intersection, or park temporarily for loading passengers. If the object is detected as exited from the scene, then the object can be evicted. Kestrel uses fast eviction: after an object exits the scene, all the feature points and information for tracking are evicted from memory. This, not only helps save memory, but also improves accuracy because evicting the keypoints from previous objects can reduce the search space for matching candidates (§II-C). Extracting the Direction of Motion: Kestrel estimates the direction of motion of a vehicle to on a mobile device to eliminate ambiguity. Suppose a vehicle is moving towards camera A as shown in Figure 5, but one of the cloud-supplied paths shows that it went through a different camera, say B located away from the vehicle’s direction of motion. Since it could not possibly have gone through B, the mobile pipeline can lower that To compensate for camera movement, we path’s rank (§II-D). Extracting the direction subtract the optical flow of the background poses two challenges: how to deal with the from the optical flow of the object bounding movement of the mobile device, and how to box. The result is the direction of the object transform the direction of motion to a global (Figure 4).Kestrel needs to transform the coordinate frame. direction of motion to a global coordinate frame in order to reason about, and re-rank, Kestrel addresses the first challenge by other vehicle paths provided by the cloud. compensating for camera movement. In For an arbitrary camera orientation, the general, to extract the direction of motion in exact moving direction in global coordinates the frame coordinates, we can use the can be calculated using a homography difference between the bounding box positions transformation [24].However, in most from consecutive frames. In our experience, practical scenarios, cameras have a small however, using the optical flow in the pitch and roll (i.e., no tilt towards ground or bounding box gives us more fine-grained sky and no rotation, videos are often direction estimation that is robust to the errors recorded in either portrait or landscape) in YOLO’s bounding boxes. especially across a small number of frames. So, we only use the azimuth of the camera, and infer only the horizontal direction of moving objects (Figure 5). As per the Android convention (that we use to obtain our dataset), right (east) is 0◦ and counterclockwise is positive in camera (global) coordinates. The transformation is simply θg = γ+(θc−90◦) where θc (θg) is the J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 5P a g e

direction of motion in the camera (global) motion to a global coordinate frame in order to coordinates, and γ is the azimuth of the reason about, and re-rank, other vehicle paths camera orientation.To obtain the azimuth for provided by the cloud. For an arbitrary camera mobile cameras we use the orientation orientation, the exact moving direction in sensor on off-the-shelf mobile devices. We global coordinates can be calculated using a have a custom Android application that homography transformation .However, in records meta-data of the video (GPS, most practical scenarios, cameras have a small orientation sensor, accelerometer data, etc.) pitch and roll (i.e., no tilt towards ground or along with the video. However, to use this sky and no rotation, videos are often recorded information directly . in either portrait or landscape) especially across a small number of frames. So, we only Extracting the Direction of Motion: Kestrel use the azimuth of the camera, and infer only estimates the direction of motion of a the horizontal direction of moving objects vehicle to on a mobile device to eliminate (Figure 5). ambiguity. Suppose a vehicle is moving towards camera A as shown in Figure 5, but . one of the cloud-supplied paths shows that it went through a different camera, say B IV. RELATED WORK located away from the vehicle’s direction of motion. Since it could not possibly have DNNs on mobile devices: Recent work has gone through B, the mobile pipeline can explored neural nets on mobile devices for lower that path’s rank (§II-D). Extracting the audio sensing activity detection, emotion direction poses two challenges: how to deal recognition and speaker identification [21]. with the movement of the mobile device, and Their networks use only a small number of how to transform the direction of motion to layers and are much simpler than the networks a global coordinate frame. required for image recognition tasks. DeepX [23] is able to achieve reduced memory Kestrel addresses the first challenge by footprint of the deep models via using compensating for camera movement. In compression, at the cost of a small loss in general, to extract the direction of motion in accuracy. LEO [26] schedules multiple the frame coordinates, we can use the sensing applications on mobile platforms difference between the bounding box efficiently. MCDNN [11] explores cloud positions from consecutive frames. In our offloading and on-device versus cloud experience, however, using the optical flow execution tradeoff, but on models smaller than in the bounding box gives us more fine- the ones required for our work. Finally, grained direction estimation that is robust to DeepMon [16] proposes offloading the FC the errors in YOLO’s bounding boxes. To layers to the GPU for low power, lower frame compensate for camera movement, we rate (1-2 fps) applications, while using a subtract the optical flow of the background tensor decomposition technique to reduce the from the optical flow of the object bounding memory footprint, at the expense of accuracy. box. Object Detection: Early object detectors use deformable part models [15] or cascade classifiers [17], but perform relatively poorly The result is the direction of the object (Figure compared to recent CNN based 4). Kestrel needs to transform the direction of classificationonly schemes which achieve high J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 6P a g e accuracy at real-time speeds. However, top REFERENCE detection systems like Fast R-CNN [18] exhibit less than real-time performance even on server-class machines. YOLO is a one-shot CNN-based detection algorithm that [1]. Agent VI. http://www.agentvi.com/. [2]. Asus ZenFone AR. predicts bounding boxes and classifies objects, https://www.asus.com/us/Phone/ZenFo and we have used a mobile GPU on a deep ne-AR-ZS571KL. CNN like YOLO. [3]. Avigilon Video Analytics. http://avigilon.com/products/video- Object Tracking: Prior tracking approaches analytics/solutions/. like blob tracking [19] work well in static [4]. A.Viterbi. “Error Bounds for camera networks, but not for mobile cameras. Convolutional Codes and an Many other trackers are targeted to single Asymptotically Optimum Decoding object tracking, Algorithm”. In:IEEE Trans. Inf. Theor. 13.2 (Sept. 2006). V. CONCLUSION [5]. B.Lucas and T.Kanade. “An Iterative Image Registration Technique with an This paper explores whether it is possible to Application to Stereo Vision”. perform complex visual detection and In:IJCAI’81. tracking by leverprovements in mobile device [6]. C.Rother, V.Kolmogorov, and capabilities. Our system, Kestrel,tracks A.Blake. “"GrabCut": Interactive vehicles across a heterogeneous multi-camera Foreground Extraction Using Iterated network by carefully designing a combination Graph Cuts”. In: SIGGRAPH ’04. of vision and sensor processing algorithms to [7]. Data Acquisition Kit. detect, track, and associate vehicles across http://www.dataq.com/products/di-149/ multiple cameras. Kestrel achieves > 90% [8]. .D.Chu et al. “Balancing Energy, Latency precision and recall on vehicle path inference, and Accuracyfor Mobile Sensor Data while significantly reducing energy Classification”. In: Proc. ofSensys 11 consumption on the mobile device.Future [9]. .Lowe. “Distinctive Image Features from work includes experimenting with Kestrel at Scale-Invariant Keypoints”. In: Int. J. Comput. scale with different traffic densities and Vision 60.2(Nov. 2004). camera placement densities than the ones we [10]. E.Mark et al. “The Pascal Visual Object have explored in this paper. We anticipate that Classes (VOC) Challenge”. In: Int. J. Comput. our results will be insensitive to camera Vision 88.2 (June 2010). placement density (because our approach only [11]. Every Chicago patrol officer to wear a body detects paths through the camera network), but camera by2018. may perform differently at different traffic [12]. G.Chen, C.Parada, and G.Heigold. “Small densities because of inaccuracies in the Footprint keyword spotting using deep neural underlying computer vision tools. Other future networks”. In: Proc. work includes extending Kestrel to support more queries, and different types of objects, so of IEEE ICASSP ’14. it can be used as a general visual analytics platform. [13]. G.Nebehay and R.Pflugfelder. “Clustering of Static-Adaptive Correspondences for Deformable Object racking”. J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 7P a g e

In: CVPR ’15. IEEE. [21]. J.Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. In: [14]. Google Maps Directions API. CVPR ’16. [15]. Google Tango[22].. J.Shi and C.Tomasi. “Good features to https://www.google.com/atap/projecttango/. track”. In: CVPR.IEEE, 1994, pp. 593–600. [16]. Google’s Pixel C[23].. K.Alex. “One weird trick for parallelizing https://store.google.com/product/pixel_c. convolutional neural networks”. In: [17]. H.Bay, T.Tuytelaars, and L.Van Gool. CoRRabs/1404.5997 (2014). “SURF: SpeededUp Robust Features”. In: [24]. Leal-Taixé et al. “MOTChallenge 2015: ECCV’06. Towards a Benchmark for Multi-Target [18]. H.Eom et al. “Machine Learning-Based Tracking”. In:arXiv:1504.01942 [cs] (Apr. Runtime Schedulerfor Mobile Offloading 2015). Lenovo Phab 2 Pro. http : / /www3. Framework”. In: Proc. ofUCC ’13. lenovo . com / us /en/smart- devices/ - lenovo- [19]. H.Marko and P.Matti. “A texture-based /phab- series /Lenovo-Phab-2- method for modeling the background and Pro. detecting moving objects”.In: IEEE Trans. on PAMI 28.4 (2006), pp. 657–662. [20]. H.Zhang et al. “Live Video Analytics at Scale with Approximation and Delay- Tolerance”. In: NSDI ’17.

J. KAVITHA, MRS K.BALA, DR. D. RAJINIGIRINATH 8P a g e