Fine-Tuned Pre-Trained Mask R-CNN Models for Surface Object Detection

Fine-tuned Pre-trained Mask R-CNN Models for Surface Object Detection Haruhiro Fujita Masatoshi Itagaki Kenta Ichikawa Faculty of Management and BSN iNet Co. Ltd. BSN iNet Co. Ltd. Information Sciences Niigata City, Japan Niigata City, Japan Niigata University of International and [email protected] [email protected] Information Studies Niigata City, Japan [email protected] Yew Kwang Hooi Kazutaka Kawano Ryo Yamamoto Department of Computer and Department of Curatorial Research Department of Cutorial Planning Information Sciences Tokyo National Museum Tokyo National Museum Universiti Teknologi PETRONAS Tokyo, Japan Tokyo, Japan Bandar Sri Iskandar, Malaysia [email protected] [email protected] [email protected] Abstract—This study evaluates road surface object detection Although speed has improved through optimization of the tasks using four Mask R-CNN models as a pre-study of surface learning process, speed remains a critical problem for real- deterioration detection of stone-made archaeological objects. time object detection for surface object detection from a The models were pre-trained and fine-tuned by COCO datasets moving sensor. Various existing CNN models can be studied and 15,188 segmented road surface annotation tags. The quality for improving bounding box potentials for surface object of the models were measured using Average Precisions and detection. Average Recalls. Result indicates substantial number of counts of false negatives, i.e. “left detection” and “unclassified This manuscript is organized as follows:- literature review detections”. A modified confusion matrix model to avoid of surface object detection technologies and applications for prioritizing IoU is tested and there are notable true positive road surface object detections; specific objectives of this increases in bounding box detection, but almost no changes in study; methods of study; result of the testing the models; segmentation masks. discussion of the profiles of each model and pinpointing critical areas for improvement. Keywords— road object detection, Mask R-CNN models, TensorFlow Object Detection API, mAP, AR, confusion matrices II. LITERATURE REVIEW I. INTRODUCTION Computer vision is a broad science field handling various A. Superpixel and Instance Segmentation types of cognitive image processing including object detection Superpixel is used as the base for instance segmentation, [1]. Object detection is a task of finding different objects in an as a graphical preparatory clustering method. It clusters pixels image and classifying them. Object detection is a critical task of vicinity in terms of geometric and color spaces prior to in autonomous driving where advanced deep learning models object segmentation using SLIC (Simple Linear Iterative in conjunction with many AI sensors play major roles to Clustering) algorithm [3, 4]. This study uses superpixel as the control the system [2]. Various objects on the road surface base for instance segmentation. It is defined and enhanced by including potholes and severe cracks require vehicle to the Common Object in Context (COCO). COCO is a large identify and maneuver over them smoothly in real-time. image-based datasets with object segmentation, recognition in R-CNN (Region Convolutional Neural Network) is a very context, differentiation of objects (cars, people, houses) and effective method for identifying objects. Results are shown as stuffs (roads, sky, wall), 200k labeled images, 1.5 million bounding box around detected objects with suitable class label. object instances, 80 object categories, 91 stuff categories, and Ever since introduction of R-CNN for object detection in 2014, 5 captions in each image [5]. various improved algorithms have been introduced, such as Instance segmentation detects an object under overlap of Fast R-CNN, Faster R-CNN and Mask R-CNN. other objects precisely. However, it does not label all of the R-CNN extracts region of proposals, compute the CNN pixels of the image, as it segments only the region of interests features and classify the regions accordingly. Selective search [6]. The state-of-the-art instance segmentation approach is the uses windows of various dimensions to determine regions detection-based method that predicts the mask for each region based on pixel features such as color, intensity and texture. after acquiring the instance region [7]. The identified regions are warped into a standard square size and parsed to a Convolutional Neural Network for features B. Mask Region Convolutional Neural Network (Mask R- extraction. Finally, Support Vector is used to classify the CNN) regions. Fast R-CNN, introduced in 2015 to improve the speed of R-CNN, is a complex pipeline. It requires huge number of forward passes and separate training of three models for image features generation, classification and regression respectively. IV. METHODOLOGY Faster R-CNN improves Fast R-CNN for less complex The methodologies are the same as previously reported training through ROI Pooling, a technique that limits the in [11]. To optimize performance of road surface detection, number of unnecessary forward passes through shared an enhanced method was applied to train pre-learned four computation result. However, Faster R-CNN is limited to Mask R-CNN models in TensorFlow Object Detection API. detect object in bounding boxes only. The models are then evaluated using sorted annotation test Enters Mask Region Convolutional Neural Network data. (Mask R-CNN), as proposed by [8]. Mask R-CNN is an extended model of the Faster R-CNN [9] which provides A. Data and Model Preparation better detection of objects in segmentations. Mask R-CNN Refer to TABLE I. The annotation data contains 12 adds learning segmentation masks (called Mask Branch) sub- classes of road damages and cognitive objects in the road module onto Faster R-CNN. The sub-module predicts images to differentiate those damages from others. Microsoft segmentation masks on each Region of Interest (RoI) [6] in Visual Object Tagging Tool (VoTT) was used for annotating parallel with each other using convolution arrays of data for 1,418 road color images with a size of 3840 x 2160. All classification and bounding box regression. The mask module visible objects of the predefined classes in two third bottom is a small fully convolutional network (FCN) applied to each part of each image were segmented and tagged. RoI, predicting a segmentation mask for each pixel. TABLE I. ROAD OBJECT ANNOTATION DATA OBTAINED FROM 1418 C. Four Mask R-CNN models in TensorFlow Object ROAD IMAGES FOR MODEL FINETUNING Detection API TensorFlow Object Detection API is an open archive of deep learning models for object detection. It contains pre- trained models with multiple datasets including COCO. The pre-trained models generate bounding boxes around objects. From the various models, four Mask R-CNN models which generate masks (instance segments) of identified objects are selected for this study [10]. D. Road damage detection system and models An application of object detection is road damage detection. The first reported road surface object detection models using Mask R-CNN are cited in [11], as shown Fig. 1. The most numerous segments were “Scratches on Markings” class at total number of 3,503 segments. This is followed by “Linear Cracks” at 3,238 segments. The minimum segments were “Grid Cracks in Patchings” at 109 segments followed by “Stains” and “Pot-holes”. The segments of each class were divided into datasets for training, training validation (validation) and testing at ratio of 0.7:0.15:0.15. B. Computation Resources The tests were conducted on a machine with Intel Core i7- 8700K (6 cores/12 threads/3.70GHz) CPU, 32GB CPU memory, Nvidia GeForce GTX 1080 GPU with 2560 CUDA core, GPU clocks of 1607/1733 MHz, 8GB GDDRSX GPU memory and 320GB/s memory bandwidth. The language for programming is Python 3.6.9 using libraries from Object Fig. 1. Road surface object detection using one of Mask R-CNN models Detection API version 1 on top of TensorFlow 1.15.0. Another related work is that of Maeda et al.[12] who had trained Single Shot Detection (SSD) models on their dataset C. Methods captured with Smartphones. Main steps of the methods are performing annotation, mask data making; model training and fine tuning. III. OBJECTIVES 1) Annotation and Mask Data Making: The objectives of this study are to:- Refer to Fig. 2. The files output by the Visual Object Tagging Tool (VoTT) were converted to intermediate • evaluate several pre-trained Mask R-CNN models in annotation files. Next, the files were converted into TensorFlow Object Detection API TensorFlow Record files for training, validation and testing. • fine-tune the models by annotation data of road The conversions were done using three separate conversion objects script executive programs in Python. The last conversion script (vott2tfrecords.py) modified the original image size of • study the applicability of the models to the road object detection 3840x2160 (width x height) to match with the training image mAP(IoU=.50) and mAP(IoU=.75) are 0.2432, 0.4382 and size of 960x540 used in machine learning. 0.2482 on bounding boxes, and 0.1600, 0.3257 and 0.1279 respectively, with notable reduction in those metrics on segmentation masks. The Precision mAP(small) for small objects = 0.0365 and 0.0133 on bounding boxes and segmentation masks respectively are notably small, compared with those Precision mAP(large) of large objects and medium objects. The Average Recall for small, medium and large object are 0.1166, 0.3132 and 0.4717 on boundary boxes and 0.1021, 0.2528 and 0.2732 on segmentation masks respectively. The detection precisions at IoU=.50 found in our target damage classes of linear cracks (Crack1), grid cracks (Crack2), pot-holes, scratches on markings, grid cracks in patchings are 0.4085, 0.4958, 0.5714, 0.5934 and 0.4000 Fig. 2. Flow for annotation and data analysis. respectively.

Fine-Tuned Pre-Trained Mask R-CNN Models for Surface Object Detection

Music Instrument Localization in Virtual Reality Environments Using Audio-Visual Cues

Evaluating Usage of Images for App Classification

Rahway Leagues

City of Lake Stevens Vision Statement by 2030, We Are a Sustainable

GPU 学習時間精度(Map) 1 GPU (NC6) 7分22秒 0.9479 2 GPU (NC12) 3分43秒 0.9479 本番稼働に向けて Confusion Matrix for Karugamo

Cross-Model Image Annotation Platform with Active Learning

On the Hunt for Excited States

Downloadable At

CLARK PATRIOT : 1 1175 St

Object Detection Using CNN Models

Meeting of the Technical Advisory Council (TAC)

PDF/A Format