Challenges and modern -based approaches to object tracking in videos

Object Tracking Introduction

Object tracking, also known as visual tracking, in a video, is one of the challenging tasks in machine vision due to appearance changes caused by variable illumination, occlusion, and variation changes (scale of objects, pose, articulation, etc.). Some of the complex problems in visual tracking is people tracking and traffic density estimation. The conventional algorithms developed by researchers are able to tackle only some of the issues in object tracking, however, they still do not provide a foolproof solution (Bombardelli, Gül, Becker, Schmidt, & Hellge, 2018).

Note: It is assumed that readers of this article are well aware of ML techniques used in object detection and object classification.

Video tracking applications

· Surveillance · Security · Traffic monitoring · Video communication · Robot vision · Accident prediction/detection

Can we use object detection at every frame to track an object in a video?

This is a very valid question, however, temporal information is a more complex task for non-static items. When things move, there are some occasions when the object disappears for a few frames. Object detection will stop at that point. If the image appears again, a lot of re-processing will be required to identify whether the object that appeared is the same object that was detected previously. The problem becomes more complex when there is more than one object in a video. Object detection is more focused on working with a single image at a time with no idea about motion that may have occurred in the past frame.

Cases where Object Detection would fail but Object Tracking would Work

In a video, there are some cases where the appearance of visual objects are not very clear. In such cases, the detection algorithms alone will fail to detect the objects. Object tracking works in such a situation because of the inclusion of motion models. Following are some cases where object tracking works while object detection fails (Tiwari & Singhai, 2017):

· Occlusion: When the object is partially or completely blocked by another object

· Scale Change: When the size of an object is largely varied, object detection will fail

www.clarifai.com · Motion Blur: The moving object in a frame may be blurred and visually doesn't look the same anymore.

· Background Clutters: When the background of the object has the same color or shape, object detection may be erroneous.

· Viewpoint Variations: When the viewing angle of an object is changed, its shape and other features will appear differently. Thus it will become difficult for object detection algorithms to identify that object.

· Low resolution: Object detection will stop working when the image has a low resolution and is pixelated.

Modeling, components, and types of tracking algorithms

Several techniques and algorithms have been introduced to solve the complex problem of object tracking in a video. These techniques rely on either motion modeling or visual appearance modeling. Motion modeling captures the dynamic behavior of an object. These approaches include Kanade-Lucas-Tomashi feature tracker, optical flow, Kalman filtering, and mean-shift tracking. Motion modeling alone fails in scenarios where there is an abrupt change in speed or direction of motion. Visual appearance modeling is more accurate as it understands the appearance of the object being tracked. This technique is useful in single object tracking but if the video frame has multiple objects, using this model alone is not enough.

There are four components of the tracking algorithm i.e. target initialization, appearance modeling, motion estimation, and target positioning. Tracking algorithms can be of two types i.e. detection based tracking and detection free tracking. Moreover, there is single-object tracking, multi-object tracking, offline trackers, and online trackers. Offline trackers are used for recorded videos while online trackers are used for live video streams.

Conventional object tracking algorithms can be useful in a resource constraint environment such as an embedded system. based approaches for object tracking are way ahead of conventional trackers in terms of accuracy.

Machine learning-based approaches for object tracking in a video

Supervised and unsupervised learning based algorithms have their own benefits and downsides when applying them for object detection in a video. However, object detection and motion prediction without supervision is a major challenge in machine learning. In one study, keypoint-based image representation and the learnings from a stochastic dynamics model of keypoints was used to address this challenge (Minderer et al. 2019). The compounding of errors in the pixel space was reduced by modeling dynamics in the keypoint coordinate space. Future frames were constructed from reference frames and keypoints. The model improves upon unstructured representations for object level understanding of motion and pixel-level video prediction.

Following are the four key approaches for object tracking in a video using machine learning algorithms:

1 - Deep Learning in Multiple Object Tracking

2 - Convolutional Neural Network based Online Training trackers

3 - CNN based offline training trackers

4 - CNN + LSTM based detection video object trackers www.clarifai.com 1 - Deep Learning for multiple object tracking

In Computer Vision tasks, multi object tracking is used to analyze the videos which contain objects belonging to more than one categories, such as animals, cars, pedestrians and other inanimate objects (Ciaparrone et al. 2020). In single object tracking, the appearance of the target is known beforehand, while in multiple object tracking a detection step is necessary to identify and categorize the targets. Deep neural networks possess the excellent ability to learn rich representations and extract abstract features from their input. The vast majority of multiple object tracking algorithms uses four major steps:

· Object detection (identification of object) · Motion prediction/feature extraction (prediction of next position of target/appearance extraction) · Affinity step (use of feature and motion prediction to compute distance/similarity score between detections · Association step (use of distance/similarity scores to generate associations and ID assigning.

In the first step detection quality is important and can be improved by using algorithms that reduce false negatives. CNN are used for feature extraction while strong trackers can use them for motion features. The adaptation of single object trackers and multiple object tracking that use deep learning has shown good-performance for online trackers.

Another study suggests the joint accomplishment of two core tasks in multiple object tracking i.e. object detection and re-identification (Zhang et al. 2020). The previous studies have shown that re-identification iis not fairly learned and causes many identity switches. The solution proposed by Zhang et al. is based on use of homogeneous branches which predict pixel-wise objectness scores and re-ID features. The experiments have shown a high level of detections and tracking accuracy.

2 - Convolutional Neural Network based online training trackers

Convolutional Neural Networks (CNNs) have been successfully applied to various computer vision tasks such as semantic segmentation, image classification, object detection, and others. CNN's have been found to outperform other algorithms in extracting features from visual data. The problem with CNN's is that there is not enough video data (medical, engineering, general, etc) available for training for visual tracking. The researchers have proposed a Multi-Domain-Network (MDNet) which learns the shared representation of objects from multiple annotated video sequences for tracking (Nam & Han, 2016). There are separate branches of domain-specific layers in the CNN for object tracking and these domains are sharing common information obtained from all sequences in a prior layer. Each domain in MDNet is trained separately using iteration and during each iteration, the shared layers were updated. This method allows the separation of domain-specific information from domain-independent information and learns generic feature representations for visual tracking. Moreover, the researchers used a small number of layers. The online tracking framework based on MDNet removes all existing classification layers when a test sequence is given. A new single branch is constructed and fully connected layers within shared layers are then fine-tuned during online tracking to adapt to the new domain.

www.clarifai.com 3 - CNN based offline training trackers

CNN based offline training allows the algorithm to track generic objects in real-time by inputting offline videos with moving objects, which was previously not able to be done accurately. The researchers have implemented a Generic Object Tracking Using Regression Networks (GOTURN) in which CNN is trained for tracking in an offline environment (Held, Thrun, & Savarese, 2016). The network weights are frozen during training which eliminates the need for fine-tuning. Using this offline training, the trackers learn to track new objects efficiently and accurately.

The benefit of offline training over online training is that online training is too slow for practical use. Offline training prepares CNN to track the objects at 100 fps. Moreover, object tracking uses classification-based approaches while the new research uses regression networks requiring a single feed-forward pass through the network. The network architecture for object tracking is shown in the figure below:

www.clarifai.com 4 - CNN + LSTM based detection video object trackers

Another modern approach to object detection in a video is based on CNN along with Long Short Term Memory (LSTM). Combining both CNN and LSTM has great benefits because LSTM is good at learning historical patterns and CNN has excellent ability to process visual information (Li, Zhao, & Zhang, 2018). Moreover, the LSTM networks are not computationally economical which enables mass adoption of real-world object trackers. An example of this combined network object detection in the video is Recurrent YOLO (You Only Look Once). Raw frames are fed to YOLO as an input. The output of YOLO is a feature vector and bounding box coordinates. The YOLO output is LSTM input and the output of LSTM is bounding box object coordinates that is tracked as shown in the figure below:

www.clarifai.com About Clarifai

Clarifai is a leading provider of for unstructured image, video, and text data. It delivers deep learning to its enterprise and public sector customers through its end-to-end computer vision and NLP platform, which manages the entire AI lifecycle.

Founded in 2013 by Matt Zeiler, Ph.D., Clarifai has been a market leader in computer vision AI since winning the top five places in image classification at the 2013 ImageNet Challenge. Clarifai, headquartered in Wilmington, DE,has raised $40M from top technology investors and is continuing to grow with more than 90 employees and offices in New York City, San Francisco, Washington, D.C., and Tallinn, Estonia.

For more information, please visit www.clarifai.com .

References:

Bombardelli, F., Gül, S., Becker, D., Schmidt, M., & Hellge, C. (2018). Efficient Object

Tracking in Compressed Video Streams with Graph Cuts. In 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP) (pp. 1–6). IEEE.

Cipparrone, Gioele, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto Tagliaferri, and Francisco Herrera. 2020. “Deep Learning in Video Multi-Object Tracking: A Survey.” Neurocomputing 381: 61–88. https://doi.org/10.1016/j.neucom.2019.11.023.

Held, D., Thrun, S., & Savarese, S. (2016). Learning to Track at 100 FPS with Deep. Computer Vision – ECCV 2016 Lecture Notes in Computer Science, 749–765. Retrieved from http://davheld.github.io/GOTURN/GOTURN.html

Minderer, Matthias, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, and Honglak Lee. 2019. “Unsupervised Learning of Object Structure and Dynamics from Videos.” Advances in Neural Information Processing Systems 32.

Nam, H., & Han, B. (2016). Learning Multi-domain Convolutional Neural Networks for Visual Tracking. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December, 4293–4302. https://doi.org/10.1109/CVPR.2016.465

Tiwari, M., & Singhai, R. (2017). A review of detection and tracking of an object from image and video sequences. Int. J. Comput. Intell. Res, 13(5), 745–765.

Zhang, Yifu, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. 2020. “FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking.” http://arxiv.org/abs/2004.01888.

www.clarifai.com