Journal of Imaging

Article A Dataset of Annotated Omnidirectional Videos for Distancing Applications

Giuseppe Mazzola, Liliana Lo Presti , Edoardo Ardizzone and Marco La Cascia *

Dipartimento di Ingegneria, Università degli Studi di Palermo, 90128 Palermo, Italy; [email protected] (G..); [email protected] (L.L.P.); [email protected] (E.A.) * Correspondence: [email protected]

Abstract: Omnidirectional (or 360◦) cameras are acquisition devices that, in the next few years, could have a big impact on video surveillance applications, research, and industry, as they can record a spherical view of a whole environment from every perspective. This paper presents two new contributions to the research community: the CVIP360 dataset, an annotated dataset of 360◦ videos for distancing applications, and a new method to estimate the distances of objects in a scene from a single 360◦ image. The CVIP360 dataset includes 16 videos acquired outdoors and indoors, annotated by adding information about the pedestrians in the scene (bounding boxes) and the distances to the camera of some points in the 3D world by using markers at fixed and known intervals. The proposed distance estimation algorithm is based on geometry facts regarding the acquisition process of the omnidirectional device, and is uncalibrated in practice: the only required parameter is the camera height. The proposed algorithm was tested on the CVIP360 dataset, and empirical results demonstrate that the estimation error is negligible for distancing applications.   Keywords: omnidirectional cameras; 360◦; video dataset; depth estimation; distancing; video surveillance;

Citation: Mazzola, G.; Lo Presti, L.; tracking; pedestrian; equirectangular projection; spherical images Ardizzone, E.; La Cascia, M. A Dataset of Annotated Omnidirectional Videos for Distancing Applications. J. Imaging 1. Introduction 2021, 7, 158. https://doi.org/ Omnidirectional (or 360◦) cameras have gained popularity in recent years, both for 10.3390/jimaging7080158 commercial applications and scientific goals. These devices can record a spherical view of an environment from every perspective, unlike traditional cameras that have a pre-set field Academic Editor: Raimondo Schettini of view. Commercial devices are now relatively cheap (a few hundreds of euros) and easy to use, as the acquisition process is fully automatic and there is no need to find the “right” Received: 5 July 2021 perspective to take a picture from. Accepted: 17 August 2021 Published: 21 August 2021 In commercial applications, users can typically interact with the recorded video, change the point of view, control the perspective, and feel surrounded by the environment. ◦ Publisher’s Note: MDPI stays neutral Moreover, 360 pictures and videos are available and usable on the most famous social with regard to jurisdictional claims in networks ( and YouTube) and can also be watched, besides using the regular published maps and institutional affil- 2D devices (laptops, mobile phones, and tablets), by using specific devices such as the iations. head-mounted viewers (, Rift, etc.) that help to improve the sense of immersivity of the users. From the scientific point of view, because of their features, omnidirectional devices find applications in virtual and [1,2], mobile robotics [3,4], and video surveillance [5–7], which is also the topic of our work. In this paper, we will present two new contributions to the research community: an Copyright: © 2021 by the authors. ◦ Licensee MDPI, Basel, Switzerland. annotated dataset of 360 videos that can be utilized for distancing problems, and a new This article is an open access article approach to estimate the distances of objects to the camera. In the second section, we will distributed under the terms and describe the features of omnidirectional devices, showing their strengths and why they can conditions of the Creative Commons be so convenient for applications. The third section will discuss the state- Attribution (CC BY) license (https:// of-the-art datasets and techniques for depth estimation problems. In the fourth section, we creativecommons.org/licenses/by/ will present our CVIP360 dataset and our distance estimation algorithm. The fifth section 4.0/). will describe the experimental results, and a conclusion section will end the paper.

J. Imaging 2021, 7, 158. https://doi.org/10.3390/jimaging7080158 https://www.mdpi.com/journal/jimaging J. Imaging 2021, 7, 158 2 of 19

section, we will present our CVIP360 dataset and our distance estimation algorithm. The fifth section will describe the experimental results, and a conclusion section will end the J. Imaging 2021, 7, 158 2 of 19 paper.

2. Omnidirectional Cameras 2. OmnidirectionalOmnidirectional Cameras (or 360°) camera devices can acquire videos and panoramic images withOmnidirectional a view (typically) (or of 360 360°◦) camera horizontally devices and can 180° acquire vertically, videos andresulting panoramic in a complete images withrepresentation a view (typically) of the environment, of 360◦ horizontally and are andparticularly 180◦ vertically, suitable resulting for video in surveillance, a complete representationmobile robotics, of and the cultural environment, heritage and applications. are particularly suitable for video surveillance, mobileThe robotics, first 360° and devices cultural were heritage made applications. from a traditional video camera filming a scene reflectedThe firston a 360 mirror◦ devices (usually were of madeparaboloid from ashape) traditional [8]. By video knowing camera the filmingmirror shape, a scene it reflectedwas possible on a mirrorto correct (usually the distortion of paraboloid and re shape)construct [8]. the By knowingoriginal scene. the mirror These shape, systems it was(reflectors) possible are to not correct exactly the omnidirectional distortion and reconstruct (at least not the vertically) original because scene.These it is impossible systems (reflectors)to film everything are not exactly that is omnidirectionalabove the mirror, (at and least for not this vertically) reason cameras because itmust is impossible be placed tohigh film up everything inside the that location. is above In thethe mirror,last years and [9], for the this newest reason omnidirectional cameras must be devices placed highcom- upbine inside multiple the location. calibrated In cameras the last yearswith partia [9], thelly newest overlapping omnidirectional fields of view. devices Each combine camera multiplecan shoot calibrated part of the cameras scene, and with the partially final im overlappingage is reconstructed fields of view.by stitching Each camera algorithms. can shootThese partsystems of the (diopters) scene, and are theexpensive final image because is reconstructedthey need more by calibrated stitching cameras algorithms. but Thesecan return systems high-resolution (diopters) are images expensive and videos because, albeit they through need more computationally calibrated cameras expensive but canreconstruction return high-resolution algorithms. images On the andother videos, hand, albeitthe most through recent computationally devices typically expensive use a sys- reconstructiontem of two wide-angle algorithms. lenses, On each the otherof which hand, can the shoot most (more recent than) devices half of typically the scene. use The a systementire panorama of two wide-angle is reconstructed lenses, each via ofsoftware which can through shoot stitching (more than) algorithms half of the after scene. correct- The entireing the panorama distortion is introduced reconstructed by via lenses if their through shape stitchingis known. algorithms Since the aftersystem correcting consists theof only distortion two cameras, introduced it is relatively by lenses inexpensive, if their shape and is known. the stitching Since process the system is much consists simpler of onlyand twomore cameras, efficient itthan is relatively when multiple inexpensive, cameras and are the used stitching (Figure process 1). is much simpler and more efficient than when multiple cameras are used (Figure1).

FigureFigure 1. 1.A A scenescene acquiredacquired by by a a dual-lens dual-lens 360 360°◦ camera camera (Gear(Gear 360 360 by by Samsung, Samsung, Seoul, Seoul, Korea) Korea) before before (top) and after (bottom) stitching. The bottom figure is the equirectangular projection of the spher- (top) and after (bottom) stitching. The bottom figure is the equirectangular projection of the spherical ical acquisition. “Quattro Canti”, Palermo. acquisition. “Quattro Canti”, Palermo.

J. Imaging 2021, 7, 158 3 of 19

Regardless of the type of device, pixels of the images are mapped onto a sphere centered into the camera. To be displayed on the appropriate output devices (monitors, mobile phones, and head-mounted devices), equirectangular and cubic projections are often adopted [10]. Cubic projections allow for the displaying of images on a monitor or a viewer by mapping spherical points onto the plane tangent to the sphere. Often, a graphical interface allows for browsing of the cube. The equirectangular projection represents the whole sphere in a single image (2:1 ratio), in which the central row is the equator, and the upper and lower rows are the poles. Each row of the image corresponds to the intersection between the sphere and a plane parallel to the horizontal plane of the camera. This projection introduces a certain distortion, which is more visible approaching the poles. Similarly, each column of the image corresponds to the intersection between the sphere and a plane perpendicular to the horizontal one, passing through the poles and rotated by a given angle with respect to the polar axis.

3. Related Work In this section, we will present the existing datasets of omnidirectional videos and the state-of-the-art techniques for depth estimation applications. In particular, the first subsection will focus on datasets for people-tracking applications, and the second one on those for depth estimation issues, together with the works in literature.

3.1. 360◦ Videos Datasets Very few annotated datasets acquired by omnidirectional (360◦) cameras are currently publicly available. Most of them are used to study the head movements of users watching 360◦ videos during immersive experiences; thus, annotations are generally provided for the users’ movements, attention, and eye gaze. A taxonomy of such datasets can be found in Elwardy et al. [11]. In some cases, 360◦ data is simulated from standard datasets. An example is the Omni-MNIST and FlyingCars datasets used in SphereNet [12]. In the field of visual tracking, recent works have adapted tracking techniques to 360◦ videos. Unfortunately, experimental results are often presented in private video collections or in YouTube videos for which annotations of the targets are not shared, for instance Delforouzi et al. [13]. In other cases, novel videos are instead acquired/downloaded from the internet and manually annotated. Mi et al. [14] propose a dataset including nine 360-degree videos from moving cameras. Three of them are captured on the ground and six of them are captured in the air by drones. Interesting targets such as buildings, bikers, or boats, one in each video, are manually marked in each frame as the ground truth of tracking. In Liu et al. [15] two datasets, BASIC360 and App360, have been proposed. Basic360 is a simple dataset including seven videos acquired at 30 fps with a number of frames ranging between 90 and 408. In these videos, only two people are annotated. In App360 there are five videos of a length between 126 and 1234 frames and a number of annotated humans between four and forty-five. Data are acquired by a Samsung Gear 360. In Yang et al. [16], a multi-person panoramic localization and tracking (MPLT) dataset is presented to enable model evaluation for 3D panoramic multi-person localization and tracking. It represents real-world scenarios and contains a crowd of people in each frame. Over 1.8 K frames and densely annotated 3D trajectories are included. The dataset in Chen et al. [17] includes three videos, two acquired outdoors and one indoors. In the first two videos, there are five pedestrians. The dataset includes annotations for their full body (3905 bounding boxes) and, for some of them, their torsos. In the third video, only the full body of one pedestrian is annotated. Additionally, in this video, five objects, three heads, and six torsos are annotated. In the videos, pedestrians move randomly—alone or in groups—around the camera. All videos are characterized by strong illumination artifacts and a cluttered background. Annotations are provided in terms of J. Imaging 2021, 7, 158 4 of 19

bounding boxes detected on cubic projections. The dataset has been proposed to evaluate PTZ trackers where the PTZ camera is simulated based on the 360◦ videos. Somewhat different is the dataset proposed by Demiröz et al. [18], where 360◦ cameras are mounted on the ceiling of a room and a sidewall; thus, there is no advantage of using a 360◦ camera to acquire the whole environment. Many datasets from fisheye cameras are available in the literature, especially for autonomous driving. For instance, WoodScape, presented by Yogamani et al. [19], is a dataset for 360◦ sensing around a vehicle using four fisheye cameras (each with a 190◦ horizontal FOV). WoodScape provides fisheye image data comprising over 10,000 images containing instance-level semantic annotation. However, to obtain a 360◦ view of the environment, several cameras are placed around the vehicle, which does not allow a unique sphere representing the environment and following the pinhole camera model to be obtained. Table1 summarizes the features of the discussed datasets, considering only those designed for visual tracking in 360◦ videos. Note that not all the features are available for all the datasets, as many of them are not (or are no longer) publicly available. With respect to those for which we have the complete information, our CVIP360 dataset is much larger, both in terms of the number of frames and number of annotated pedestrians; furthermore, the videos have a higher resolution.

Table 1. Summary of the features of the 360◦ datasets at the state of the art for visual tracking.

N◦ N◦ N◦ Name Year Resolution Availability Videos Frames Annotations Delforouzi 2019 14 n/a n/a 720 × 250 Private et al. Mi et al. 2019 9 3601 3601 1920 × 1080 Public Liu et al. 2018 12 4303 n/a 1280 × 720 Private Yang et al. 2020 n/a 1800 n/a n/a Private Chen et al. 2015 3 3182 16,782 640 × 480 Public CVIP360 2021 16 17,000 50,000 3840 × 2160 Public (ours)

Furthermore, differently from these datasets, our novel dataset focuses on distancing applications. It can also be used for tracking applications since pedestrians have been fully annotated but, differently than the above datasets, it also provides annotations of markers to assess distance computation of pedestrians to the camera. Omnidirectional videos may be used as well in other computer vision research fields, such as face recognition. With respect to traditional techniques [20–22], the equirectangular projection introduces position-dependent distortions which must be considered when devising specific algorithms. This is a very new open issue, but some works have already been published [23,24]. To the best of our knowledge, there are no available datasets of videos acquired by an omnidirectional camera network (that is one of our future works), which are necessary in object re-identification applications [25].

3.2. Depth Estimation Depth estimation is a well-known research field of computer vision [26] since the early 1980s, and has applications in various domains: robotics, autonomous driving, 3D modeling, scene understanding, augmented reality, etc. Classical approaches are typically divided into two categories: active methods, which exploit sensors (e.g., the Microsoft ) to build a map of the depths of the points in a scene, and passive methods, which use only visual information. The active methods have higher accuracy but cannot be used in all the environments and are more expensive. Passive methods are less accurate but have fewer constraints and are cheaper. Passive methods can be roughly divided into two subcategories, according to the hardware devices: stereo-based [27], in which a couple of calibrated or uncalibrated cameras are used to shoot J. Imaging 2021, 7, 158 5 of 19

a scene from two different points of view, and monocular [28], in which depth information is extracted by a single image. Some of the newest approaches use deep neural networks on monocular images [29–34]. Regarding 360◦ images and videos, some new works have been published in the very last years. Wang et al. [35] propose a trainable deep network, 360SD-Net, designed for 360◦ stereo images. The approaches of Jiang et al. [36] and Wang et al. [37] fuse information extracted by both the equirectangular and the cube map projections of a single omnidirectional image a fully convolutional network framework. Jin et al. [38] propose a learning-based depth estimation framework based on the geometric structure of the scene. Im et al. [39] propose a practical method that extracts a dense depth map from the small motion of a spherical camera. Zioulis et al. [40] first proposed a supervised approach for depth estimation from single indoor images and then extended their method to stereo 360◦ cameras [41]. In Table2 we show the features of the discussed datasets. With respect to the others, even if CVIP360 is a little bit smaller than two of them, CVIP360 is the only one that also includes videos of outdoor scenes and annotations of moving objects (pedestrians).

Table 2. Summary of the features of the state-of-the-art 360◦ datasets for depth estimation.

Name N◦ 360◦ Images Synthetic/Real Environment Pedestrians Matterport3D 10,800 Real Indoors No Stanford2D3D 1413 Real Indoors No 3D60 23,524 Real/Synthetic Indoors No PanoSUNCG 25,000 Synthetic Indoors No CVIP360 (ours) 17,000 Real Indoors/Outdoors Yes

Almost all these methods are based on deep network techniques; they then need proper and solid datasets to train their frameworks. In the following sections, we will present a new annotated dataset of 360◦ videos that can be used for training supervised approaches, and an unsupervised algorithm to estimate the depths of the points in a scene.

360◦ Datasets for Depth Estimation Only a few existing datasets of omnidirectional videos for depth estimation ap- plications are available to the research community. Matterport3D [42] is a dataset of 10,800 panoramic views of real indoor environments (with no pedestrians), provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentation, designed for scene-understanding applications. Views are acquired by a rotating tripod system to build cylindrical projections. Stanford2D3D [43] includes 70,496 regular RGB and 1413 equirectangular RGB images, along with their corresponding depths, surface normals, semantic annotations, global XYZ OpenEXR formats, and camera metadata. Images of indoor environments, without persons, are acquired with the same setting of Mattertport3D and projected on a cylindrical surface. The 3D60 [40] dataset is created by rendering the two previous realistic datasets and two synthetic datasets, SceneNet [44] and SunCG [45], resulting in 23,524 unique viewpoints, representing a mix of synthetic and realistic 360◦ color, I, and depth, D, image data in a variety of indoor contexts. Even in this case, only environments and no pedestrians are acquired. PanoSUNCG [46] is a purely virtual dataset rendered from SunCG. This dataset consists of about 25k 360◦ images of 103 indoor scenes (with no pedestrians), panorama depth, and camera pose ground truth sequence. Table1 summarizes the features of the discussed datasets.

4. CVIP360 Dataset This section presents our CVIP360 dataset, which is freely available at the link [47]. The dataset is designed for testing tracking and distancing algorithms in omnidirectional videos. Our dataset includes 16 real videos, captured by a Garmin VIRB 360 omnidirectional device [48] in 4K (resolution 3840 × 2160) and with a frame rate of 25 fps. Ten of these videos J. Imaging 2021, 7, 158 6 of 19

J. Imaging 2021, 7, 158 6 of 19

have been shot indoors and six outdoors, to capture different luminance and topological conditionstwo to four (Figure people.2). Overall, The videos the lastdataset from incl 15 toudes 90 s17,000 and show equirectangular two to four people.frames and Overall, over the50,000 dataset annotated includes bounding 17,000 equirectangular boxes of pedestrian framess (the and product over 50,000 of the annotated number of bounding frames in boxesa video of pedestriansand the number (the productof persons). of the Each number video of has frames been in annotated a video and according the number to two of persons).different Eachstrategies: video has been annotated according to two different strategies: -- WeWe manuallymanually labeledlabeled allallthe the pedestrians’pedestrians’locations locations inin the the equirectangular equirectangular frames frames by by usingusing boundingbounding boxes.boxes. -- WeWe assignedassigned to to eacheach pixelpixel ofof thethe equirectangularequirectangular framesframes aa “real”“real” distance distance toto the the cam-cam- era’s location, according to a methodology described below in this section. In practice, we annotated only the image portion representing the area under the horizon. This area is related to the ground where people walk, as further explained in this section.

FigureFigure 2.2.Two Two frames frames from from the the videos videos of of the the CVIP360 CVIP360 dataset. dataset. Outdoors Outdoors (top (top) and) and indoors indoors (bottom (bot- ). tom). Note that in equirectangular images the horizon becomes a straight line across the middleNote of the that image, in equirectangular and it is the projection images the of the horizon camera becomes horizontal a straight plane onto line the across sphere. the Tomiddle our knowledge, of the image, this isand the it first is the dataset projection of omnidirectional of the camera videos horizontal of indoor plane and outdooronto the scenes,sphere. containingTo our knowledge, both information this is the first about da thetaset depths of omnidirectional and the annotations videos of ofindoor moving and pedestrians,outdoor scenes, designed containing for both both distancing informatio andn video about surveillance the depths applications.and the annotations of moving pedestrians, designed for both distancing and video surveillance applications.

4.1. Distancing Annotation Before capturing the scene with our device, we put some markers onto the ground at fixed and known distances to the camera. For the outdoor videos, we put 10 markers, J. Imaging 2021, 7, 158 7 of 19

J. Imaging 2021, 7, 158 7 of 19

4.1. Distancing Annotation spacedBefore one meter capturing apart thestarting scene from with the our camera, device, at we 0° put(the somedirection markers in front onto of theone ground of the twoat fixedlenses) and and known 10 more distances markers to the at camera.90° (orthogonally, For the outdoor where videos, both weof the put two 10 markers, lenses acquirespaced information). one meter apart In our starting device, from each the of camera, the two at lenses 0◦ (the has direction a field of in view front of of 201.8°, one of theto avoidtwo lenses)possible and “blind 10 more spots” markers in the at acquisition 90◦ (orthogonally, process. where Finally, both in of the the resulting two lenses stitched acquire imageinformation). the information In our device, around each 90° and of the 270° two is lensesreconstructed has a field by ofblending view of the 201.8 two◦, toimages avoid comingpossible from “blind the spots”camera. in the acquisition process. Finally, in the resulting stitched image theDue information to the acquisition around 90◦ and 270projection◦ is reconstructed processes, by all blending points of the a two horizontal images comingplane equallyfrom the distant camera. to the camera in the 3D world belong to the same circle on that plane and correspondDue to to the horizontal acquisition lines and in projection the equire processes,ctangular all pointsimage of(Figur a horizontale 3). In planeour experi- equally mentaldistant setting, to the camera markers in theon 3Dthe world ground belong lie on to thea horizontal same circle plane. on that Theoretically, plane and correspond we just neededto horizontal to place lines markers in the along equirectangular one direction, image e.g., (Figurein front3 ).of Inthe our camera, experimental since distances setting, inmarkers the other on directions the ground can lie be oninferred. a horizontal We decided plane. to Theoretically, also place markers we just along needed the orthog- to place onalmarkers direction along to onemeasure direction, distortions e.g., in due front to non-perfect of the camera, camera since alignment distances (in in terms the other of pitchdirections and roll). can beIn inferred.such a case, We decidedthe ground to also cannot place be markers considered along theparallel orthogonal to the directioncamera horizontalto measure plane, distortions breaking due our to horizontality non-perfect camera hypothesis. alignment As a result, (in terms two of markers pitch and at roll).the sameIn such distance a case, to the the ground camera cannot must match be considered pixels in parallel the same to therow camera of the horizontalequirectangular plane, framebreaking (unless our for horizontality human errors hypothesis. in positionin As ag result, them twoonto markers the ground, at the different same distance ground to levels,the camera etc.). Small must matchdeviations pixels from in the 0° samein the row roll of or the pitch equirectangular camera angles frame may (unless shift the for markerhuman coordinates errors in positioning in the imag themes upwards onto the or ground,downwards different by tens ground of pixels. levels, These etc.). distor- Small ◦ tionsdeviations have been from corrected 0 in the roll in post-processing or pitch camera anglesby the may VR shiftdistortion the marker correction coordinates tool of in the the Adobeimages Premiere upwards CC or downwardscommercial bysoftware, tens of pixels.to ensure These the distortions “horizontality” have been condition. corrected The in abovepost-processing procedure byyields the VRto the distortion pixel coordinates correction of tool these of the markers, Adobe at Premiere fixed and CC known commercial dis- tances,software, in the to ensureequirectangular the “horizontality” images. condition. The above procedure yields to the pixel coordinates of these markers, at fixed and known distances, in the equirectangular images.

FigureFigure 3. 3. ProjectionProjection of of a acircle circle onto onto an an equirectangular equirectangular image: image: in in the the case case where where the the circle circle is is on on a a planeplane parallel parallel to to the the camera camera horizontal horizontal plane plane (top (top), ),and and in in the the case case wher wheree the the two two planes planes are are not not parallelparallel (bottom (bottom). ).Note Note that that in in the the second second case case points points that that belong belong to to the the same same circle circle relate relate in in the the equirectangularequirectangular image image to to a asinusoid-like sinusoid-like curve. curve.

J. Imaging 2021, 7, 158 8 of 19

J. Imaging 2021, 7, 158 8 of 19

On the other hand, for a more precise estimation of these distortions, in the case of indoor videos, we put markers along four directions (0°, 90°, 180°, and 270°), again spaced On the other hand, for a more precise estimation of these distortions, in the case oneof indoormeter apart, videos, starting we put from markers the camera along and four reaching directions the (0 furthest◦, 90◦, 180 distances◦, and 270of the◦), againloca- tion,spaced e.g., one the meter walls,apart, except starting in one direction, from the camerain which and the reaching site is partially the furthest open distances(see Figure of 2).the Due location, to poor e.g., illumination the walls, exceptconditions in one and direction, camera resolution, in which the not site all is markers partially are open clearly (see visibleFigure in2). the Due frames to poor to illuminationmatch them to conditions the pixels and in the camera equirectangular resolution, notimage. all markers For indoor are videos,clearly we visible decided in the to frames limit the to maximum match them annota to theted pixels distance in the to equirectangular the camera to 6 image. m, which For isindoor adequate videos, for wemost decided indoor to environments. limit the maximum As well annotated as the distanceoutdoor to ones, the camerathese videos to 6 m , neededwhich isa adequatepost-processing for most step indoor to correct environments. distortions As due well to as misalignments. the outdoor ones, these videos neededKnowing a post-processing the relationships step to between correct distortionspixel coordinates due to of misalignments. markers in the images and their distanceKnowing to the the relationships camera, all other between matches pixel between coordinates image of markers pixels and in thetrue images distances and maytheir be distance estimated to theby interpolation. camera, all other To estimate matches this between relationship, image pixelswe tested and several true distances regres- sionmay models, be estimated but we by observed interpolation. empirically To estimate that piecewise this relationship, linear interpolation we tested several is accurate regres- enoughsion models, for our but goals. we observedFinally, we empirically assigned a that “rea piecewisel” distance linear to the interpolation camera to the is pixels accurate in theenough equirectangular for our goals. image Finally, either we by assigneddirect me aasurement “real” distance (in the to marker the camera spots) to or the by inter- pixels polationin the equirectangular (for all the other image points) either (Figure by 4). direct Note measurement that, with our (in method, the marker we annotate spots) oronly by theinterpolation lower part (forof the all image the other (under points) the (Figurehorizon),4). which Note that, includes with ourthe points method, on we the annotate ground satisfyingonly the lowerthe horizontality part of the cons imagetraint. (under Note the also horizon), that the which angular includes coordinate the pointsof a point on thein theground scene satisfying (pan) is the the angle horizontality formed by constraint. the line connecting Note also thatthe point the angular to the coordinatecenter of the of camera,a point and in the the scene reference (pan) direction is the angle (e.g., formedthe frontal by one) the line is always connecting known the in pointomnidirec- to the tionalcenter cameras of the camera, as: and the reference direction (e.g., the frontal one) is always known in omnidirectional cameras as: 𝑤 𝑥−w x − 2 ◦ β𝛽== w𝑤 ·∙180 180° (1)(1) 22 wherewhere ww isis the the image image width andand xx isis thethex x-coordinate-coordinate of of the the equirectangular equirectangular image. image. Knowing Know- ingthe the distance distance and and the the pan pan angle angle of a of point, a point, we know we know its relative its relative position position to the to camera the camera in the inreal the world real world (Figure (Figure5). 5).

Figure 4. Cont.

J. Imaging 2021, 7, 158 9 of 19 J. Imaging 2021, 7, 158 9 of 19 J. Imaging 2021, 7, 158 9 of 19

FigureFigure 4. 4. RelationshipRelationship between between the the markers’ distance toto thethe cameracamera (in (in meters) meters) and and the the corresponding correspond- Figureingy-coordinate y-coordinate 4. Relationship in the in equirectangularthe between equirectangular the markers’ image, image, with distance different with to different the values camera ofvalues camera (in meters) of height camera and (colored theheight correspond- curves) (colored for ing y-coordinate in the equirectangular image, with different values of camera height (colored curves)outdoor for (top outdoor) and indoor (top) and (bottom indoor) videos. (bottom Points) videos. at distances Points at 1, distances 2, 3 ... meters 1, 2, 3 relate … meters to the relate markers’ to curves)the markers’ for outdoor locations. (top Points) and indoor between (bottom two marker) videos.s arePoints linearly at distances interpolated. 1, 2, 3 …The meters points relate at the to locations. Points between two markers are linearly interpolated. The points at the “zero” distance the“zero” markers’ distance locations. refer to thePoints pixels between belongin twog tomarker the lasts arerow linearly of the image interpolated. (y = 2160). The points at the “zero”refer to distance the pixels refer belonging to the pixels to the belongin last rowg ofto thethe imagelast row (y of = 2160).the image (y = 2160).

Figure 5. Annotation of pedestrians into the equirectangular image (top). Bird’s eye view of the Figure 5. Annotation of pedestrians into the equirectangular image (top). Bird’s eye view of the same same scene (bottom). For each pedestrian, the image shows the corresponding distance, d, on the groundscene ( bottomand the). angle, For each 𝛽 (pan). pedestrian, the image shows the corresponding distance, d, on the ground and the angle, β (pan).

J. Imaging 2021, 7, 158 10 of 19 J. Imaging 2021, 7, 158 10 of 19

5. Distance Estimation Algorithm 5. DistanceIn this section, Estimation we describe Algorithm our algorithm to estimate the distances of the objects in the sceneIn this to the section, camera. we Our describe method our is algorithm based on topure estimate geometrical the distances facts under of thethe objects“hori- zontality”in the scene hypothesis to the camera. discussed Our in methodthe previous is based section. on pure Furthermore, geometrical our facts method under is un- the calibrated,“horizontality” as there hypothesis is no need discussed to know in or the estimate previous the section. camera Furthermore, parameters, ourexcept method for its is height,uncalibrated, which ascan there be easily is no measured need to know during or estimate the acquisition the camera process. parameters, except for its height,Given which an canobject be in easily the measuredscene (e.g., during a pers theon) acquisition touching the process. ground (supposed to be parallelGiven to the an camera object inhorizontal the scene plane) (e.g., in a person)a point (Figure touching 6), thethe grounddistance, (supposed d, of this point to be toparallel the camera to the can camera be estimated horizontal as: plane) in a point (Figure6), the distance, d, of this point to the camera can be estimated as: d𝑑=ℎ= hCcotcotα 𝛼 (2)(2)

wherewhere hC isis the the camera heightheight andandα αis is the the angle angle between between the the line line that that joins joins the the center center of theof thecamera camera and and the the horizon horizon and and the segmentthe segment that that joins joins the center the center of the of camera the camera and the and point the pointof intersection of intersection between between the subject the subject and the and ground. the ground. Distance, Distance,d, is one d of, is the one two of catheti the two of cathetia right of triangle a right and triangle is related and is to related the other to the one other by the one trigonometric by the trigonometric Formula (2).Formula (2).

◦ FigureFigure 6. 6. GeometricalGeometrical representation representation of of a a subject subject acquired acquired by a a 360° 360 device.device. dd isis the the distance distance between between thethe camera camera and and the the subject, subject, hCh isC isthe the camera camera height, height, and and α isα theis the angle angle between between the theline linethatthat joins joins the centerthe center of the of thecamera camera and and the thehorizon horizon and and the the segme segmentnt that that joins joins the the center center of of the the camera camera and and the the pointpoint of of intersection intersection between between the the subject subject and and the the ground. ground.

αα cancan be be computed computed directly directly from from the the frames frames of of the the videos as ℎ y𝑦 −−h = L 22· ◦ α𝛼= h 90∙ 90° (3)(3) 2ℎ 2 where yL is the y coordinate, in pixels, of the lower bound of the subject (e.g., in the case where yL is the y coordinate, in pixels, of the lower bound of the subject (e.g., in the case of a person, the points where the feet touch the ground, that is the lower part of the of a person, the points where the feet touch the ground, that is the lower part of the bound- bounding box) and h is the equirectangular image height (also in pixels) (Figure7). Note ing box) and h is the equirectangular image height (also in pixels) (Figure 7). Note that that Equation (2) can also be applied if the height of the camera is greater than that of the Equation (2) can also be applied if the height of the camera is greater than that of the subject. Remember that all the points in a plane parallel to the horizontal plane and equally subject. Remember that all the points in a plane parallel to the horizontal plane and distant to the camera are projected on the same row of the resulting equirectangular image. equally distant to the camera are projected on the same row of the resulting equirectan- Finally, the only parameter that we need to know to compute the distances in the real world gular image. Finally, the only parameter that we need to know to compute the distances is the camera height, which can be easily measured during the acquisition phase. in the real world is the camera height, which can be easily measured during the acquisi- Our method is totally unsupervised, does not need any training phase or any complex tion phase. computation, but is just the result of a trigonometric formula. It can be used as a post- processingOur method method is of totally computer unsupervised, vision applications, does not e.g., need tracking any people,training without phase affectingor any complexthe computational computation, load. but For is example,just the result picking of a from trigonometric the equirectangular formula. It image can be the used angular as a post-processingcoordinate of an method object, asof explainedcomputer invision Section applications, 4.1, and estimating e.g., tracking its distance people, fromwithout the affectingcamera, wethe know computational the location load. (in polarFor example, coordinates) picking of the from object the in equirectangular our coordinate systemimage the(note angular that the coordinate camera of location an object, is the as explaine origin ofd in the Section system). 4.1, Withand estimating simple mathematical its distance fromoperations, the camera, it is possible we know to computethe location the mutual(in polar distance coordinates) between of twothe object objects in in our the

J. Imaging 2021, 7, 158 11 of 19

J. Imaging 2021, 7, 158 11 of 19

coordinate system (note that the camera location is the origin of the system). With simple mathematical operations, it is possible to compute the mutual distance between two realobjects world, in the which real is world, useful forwhich many is differentuseful for applications, many different such applications, as monitoring such health as restrictions,monitoring health observing restrictions, tourist behavior observing in to culturalurist behavior sites, etc. in cultural sites, etc.

Figure 7.7. How to compute α fromfrom anan equirectangularequirectangular image.image. α isis proportionalproportional toto thethe segmentsegment thatthat joinsjoins thethe horizonhorizon toto thethe lowerlower boundbound ofof thethe boundingbounding box,box, asas shownshown inin EquationEquation (3).(3). 6. Results 6. Results We tested our distance estimation algorithm on the CVIP360 dataset described in We tested our distance estimation algorithm on the CVIP360 dataset described in Section3. The method could not be tested on other publicly available datasets since the Section 3. The method could not be tested on other publicly available datasets since the camera height was unknown. Furthermore, in many of these datasets, videos are obtained camera height was unknown. Furthermore, in many of these datasets, videos are obtained by cylindrical projections, so our method, which is based on geometrical facts upon the by cylindrical projections, so our method, which is based on geometrical facts upon the spherical acquisition process of the scene, cannot be applied. spherical acquisition process of the scene, cannot be applied. For each person and each frame in a video, we consider the vertical coordinate of the For each person and each frame in a video, we consider the vertical coordinate of the lower bound of the corresponding bounding box, which are points located on the ground. lower bound of the corresponding bounding box, which are points located on the ground. We can assign to the person a distance according to the graph in Figure4, which relates We can assign to the person a distance according to the graph in Figure 4, which relates bounding box coordinates to the real distance to the camera. This value is used as the groundbounding truth box value coordinates to assess to the the estimation real distance error to committed the camera. by our This algorithm value is accordingused as the to Formulaground truth (2). To value evaluate to assess the performances,the estimationwe error used committed two different by our metrics: algorithm according to FormulaMeanAbsolute (2). To evaluate Error the performances, we used two different metrics: Mean Absolute Error 1 MAE = di − di (4) N ∑ e T 1 i 𝑀𝐴𝐸 = 𝑑 −𝑑 (4) Mean Relative Error 𝑁 1 di − di = e T Mean Relative Error MRE ∑ i (5) N i dT 1 𝑑 −𝑑 where di is the estimated distance, d𝑀𝑅𝐸i is the = true distance, and N is the number of evaluated(5) e T 𝑁 𝑑 bounding boxes, which depends on the number of peeople and the number of frames in each video. where 𝑑 is the estimated distance, 𝑑 is the true distance, and N is the number of evaluatedWe measured bounding these boxes, metrics which only depends within on the the distancenumber rangeof peeople for which and the we number have the of groundframes in truth each (10 video. m for outdoor videos, 6 m for indoor ones). All values above these thresholdsWe measured are clipped these to themetrics maximum only within value. Ifthe both distance the estimated range for and which the true we distanceshave the areground above truth the (10 maximum m for outdoor distance, videos, the corresponding 6 m for indoor errorones). is All not values computed above in these the metrics. thresh- olds MAEare clipped measures to the the maximum absolute errorvalue. (in If bo meters)th the betweenestimated the and estimated the true distancedistances of are a person to the camera and the real one. Figure8 shows how the MAE changes for increasing above the maximum distance, the corresponding error is not computed in the metrics. values of the ground truth distance in outdoor and indoor videos. Note that the average MAE measures the absolute error (in meters) between the estimated distance of a absolute error is 5 cm in outdoor videos and 2.5 cm in the indoor ones and that the error person to the camera and the real one. Figure 8 shows how the MAE changes for slightly increases only at higher distances, but is however very low and acceptable for any type of distancing application. Also note that moving away from the camera, points of

J. Imaging 2021, 7, 158 12 of 19

increasing values of the ground truth distance in outdoor and indoor videos. Note that J. Imaging 2021, 7, 158 the average absolute error is 5 cm in outdoor videos and 2.5 cm in the indoor ones12 and of 19 that the error slightly increases only at higher distances, but is however very low and acceptable for any type of distancing application. Also note that moving away from the camera, points of the equirectangular image are closer to the horizon, and the space betweenthe equirectangular a marker and image another are closer (1 m in to the horizon,real world) and is the projected space between in a smaller a marker range and of pixelsanother in (1the m image. in the real world) is projected in a smaller range of pixels in the image.

FigureFigure 8. 8. ExperimentalExperimental results. results. Mean Mean absolute absolute error error versus versus distance distance to to the the camera, camera, for for outdoor outdoor ( (toptop)) and indoor (bottom) videos. Note that the maximum measurable distance is 6 m in indoor videos and indoor (bottom) videos. Note that the maximum measurable distance is 6 m in indoor videos and 10 m in outdoor videos. Graphs also report the overall values. and 10 m in outdoor videos. Graphs also report the overall values.

MREMRE measures measures the the relative error between th thee estimated distance of of a person to the cameracamera and and its its ground ground truth truth value, value, with with respec respectt to the last one. In Figure 99 wewe showshow thethe values of MRE versus the real distance va values.lues. As expected, MRE has its maximum value forfor the the closest distance (as(as thethe estimatedestimated distancedistance is is normalized normalized to to the the real real distance), distance), but but it itis is in in any any case case lower lower than than 9% 9% and, and, in in most most cases, cases, around around 1%.1%. SimilarSimilar toto MAE,MAE, MRE also slightlyslightly increases increases at the furthest distances.

J. Imaging 2021, 7, 158 13 of 19 J. Imaging 2021, 7, 158 13 of 19

FigureFigure 9. 9. ExperimentalExperimental results. results. Mean Mean relative relative error error ve versusrsus distance distance to to the the camera, camera, for for outdoor outdoor (top (top) ) andand indoor indoor ( (bottombottom)) videos. videos. Note Note that that the the maximum maximum measurab measurablele distance distance is is 6 6 m m in in indoor indoor videos videos and 10 m in outdoor videos. Graphs also report the overall values. and 10 m in outdoor videos. Graphs also report the overall values.

WeWe also also measured measured the the MAE MAE and and MRE MRE for for different different values values of ofcamera camera height, height, which which is theis theonly only setting setting parameter parameter that can that affect can affectour results our results (Figure (Figure 10). More 10 ).specifically, More specifically, assum- ingassuming ht is the h truet is the camera true cameraheight, height,we measured we measured how the how error the varies error when varies a when different a different value ofvalue camera of camera height, height, he, is used he, is in used Equation in Equation (2), thus (2), thussimulating simulating some some possible possible inaccuracy inaccuracy in measuringin measuring this this parameter parameter during during the thevideo video acquisition acquisition process. process. MRE shows if the error is dependent on the value of the distance. By exploiting Equa- tion (2), it can easily be shown that MRE depends only on the deviation between the true camera height and the value used to compute the estimated distance in Equation (2): 1 𝑑 −𝑑 1 ℎ𝑐𝑜𝑡𝛼 −ℎ𝑐𝑜𝑡𝛼 1 ℎ −ℎ 𝑀𝑅𝐸 = = = (6) 𝑁 𝑑 𝑁 ℎ 𝑐𝑜𝑡𝛼 𝑁 ℎ where 𝑑 is the estimated distance for the i-th point, 𝑑 is the corresponding ground truth value, and αT is the angle computed on the equirectangular image as explained in

J. Imaging 2021, 7, 158 14 of 19

J. Imaging 2021, 7, 158 Equation (2). Therefore, a variation of 5% in the camera height parameter with respect14 of 19to its true value causes a change of 0.05 in the MRE. The experimental results (see Figure 10) confirm this observation.

FigureFigure 10.10. Experimental results.results. Sensitivity ofof MAE and MRE errors versus the height of the camera. x-axisx-axis isis thethe deviationdeviation (in(in percentage)percentage) betweenbetween thethe cameracamera heightheight setset byby thethe useruser inin EquationEquation (2)(2) andand the true value. the true value.

LimitationsMRE shows and Strengths if the error is dependent on the value of the distance. By exploiting EquationIn this (2), section it can easily we present be shown weaknesses that MRE dependsand points only of on strength the deviation of our between estimation the truemethod. camera In regard height andto the the limitations, value used we to computebelieve that the the estimated accuracy distance of the inmethod Equation mainly (2): depends on human imprecision in the setting phase (as the markers may all be placed 1 di − di 1 hi cotαi − hi cotαi 1 hi − hi incorrectly at the= desired edistances)T = and one the environmentT T T =conditions e (i.e.,T the ground MRE ∑ i ∑ i i ∑ i (6) is not exactly flat),N ratheri thandT on theN usedi equation.hTcot αT N i hT We also observed that the error increases when the person appears close to the image i i wherebordersd e(seeis theFigure estimated 11). This distance may depend for the on i-th several point, reasons:dT is the corresponding ground truth- The value, lens and distortionαT is the is angleat its computedmaximum onat th thee borders, equirectangular so the assumption image as explainedthat the pro- in Equationjection (2) .is Therefore, spherical may a variation not hold; of 5% in the camera height parameter with respect to its- truePixels value near causes the borders a change of of the 0.05 equirectan in the MRE.gular The image experimental are reconstructed results (see by Figureinterpola- 10) confirmtion; this observation.

Limitations- After correcting and Strengths the distortion to ensure the horizontality hypothesis, as explained in Section 4.1, a residual misalignment between the markers at the center and those at In this section we present weaknesses and points of strength of our estimation method. the borders results in different pixel positions. In regard to the limitations, we believe that the accuracy of the method mainly depends on humanWith imprecision respect to in the the last setting point, phase in fact, (as thewe markersused as mayground all betruth placed the incorrectlyaverage value at the of desiredthe pixel distances) coordinates and of on the the markers environment at the conditionssame distance, (i.e., but the the ground difference is not exactlybetween flat), the ratherreal distance than on and the the used estimated equation. one, in practice, should depend on the object position. For simplicity,We also since observed this difference that the error is very increases small when in allthe the person cases, appearswe decided close to to ignore the image this borderseffect. (see Figure 11). This may depend on several reasons: Figure 12 also shows two representative cases where our algorithm cannot be ap- - The lens distortion is at its maximum at the borders, so the assumption that the plied. If a person jumps high, during the action for some instants the feet do not touch the projection is spherical may not hold; ground, then the related bounding box is above the plane and the horizontality hypothesis - Pixels near the borders of the equirectangular image are reconstructed by interpolation; does not hold. The hypothesis is broken as well when a person goes upstairs, as the feet - After correcting the distortion to ensure the horizontality hypothesis, as explained in are notSection in contact 4.1, a residualwith the misalignment ground. In our between dataset the the markers number at of the frames center in and which those this at conditionthe borders does not results hold inis differentnegligible. pixel positions.

J. Imaging 2021, 7, 158 15 of 19 J. Imaging 2021, 7, 158 15 of 19

FigureFigure 11. 11. WeWe observed observed that that the the method method is isless less accura accuratete when when pedestrians pedestrians approach approach the the borders borders of theof theequirectangular equirectangular image, image, as in as the in case the casein the in figure the figure (top). (top The). bird’s The bird’s eye view eye viewof the of scene the scene(bot- tom(bottom), showing), showing the positions the positions of all of the all thepersons persons into into the the camera camera system. system. For For the the person person close close to to the the borders,borders, we we show show both both the the correct correct (according (according to to th thee ground ground truth) truth) and and the the estimated estimated (according (according to to our method) positions, in green and black respectively. In the other cases, the error is negligible with our method) positions, in green and black respectively. In the other cases, the error is negligible with respect to the scale of the graph. respect to the scale of the graph.

With respect to the last point, in fact, we used as ground truth the average value of the pixel coordinates of the markers at the same distance, but the difference between the real distance and the estimated one, in practice, should depend on the object position. For simplicity, since this difference is very small in all the cases, we decided to ignore this effect. Figure 12 also shows two representative cases where our algorithm cannot be applied. If a person jumps high, during the action for some instants the feet do not touch the ground, then the related bounding box is above the plane and the horizontality hypothesis does not

J. Imaging 2021, 7, 158 16 of 19

hold. The hypothesis is broken as well when a person goes upstairs, as the feet are not in J. Imaging 2021, 7, 158 16 of 19 contact with the ground. In our dataset the number of frames in which this condition does not hold is negligible.

FigureFigure 12. 12. SomeSome representative representative cases cases where where the the method method cannot cannot be applied: person person jumping jumping high high ((toptop);); person person going going upstairs upstairs ( (bottombottom).).

TheThe strengths strengths of of the the proposed proposed method method are are its its simplicity simplicity (it (it requires requires the the use use of of just just a a singlesingle mathematical mathematical equation) equation) as as well well as as its its efficiency, efficiency, since since no no complex computation is is required.required. The The method method is is unsupervised unsupervised (no (no training training images images are are needed) needed) and and practically practically doesdoes not not require require calibration calibration (only (only the the camera camera height height value value has hasto be to known). be known). It can It be can used be inused several in several practical practical situations, situations, as the as method the method requirements requirements are satisfied are satisfied in most in most of the of en- the vironmentsenvironments and and scenes. scenes. Furthermor Furthermore,e, results results show show the accuracy the accuracy is very is high, very high,in most in cases most onlycases a onlyfew centimeters a few centimeters of error. of error.

7.7. Conclusions Conclusions ◦ ComputerComputer vision vision research research opened opened up up to to th thee 360° 360 multimediamultimedia processing processing world, world, and and with this paper we want to give our contribution to the field. First, we contribute our with this paper◦ we want to give our contribution to the field. First, we contribute our datasetdataset of of 360° 360 annotatedannotated videos, videos, as as currentl currentlyy only only a a few few are are available available to to the the community. community. NoNo technological technological improvements improvements can can be be done done wi withoutthout data data to to test test new new solutions, solutions, and and our our datasetdataset will will help help researchers researchers to to develop develop and and evaluate evaluate their their works, especially especially for for video surveillance applications, e.g., tracking pedestrians in 360◦ videos. To our knowledge surveillance applications, e.g., tracking pedestrians in 360° videos. To our knowledge this is the first dataset with indoor and outdoor omnidirectional videos of real scenarios, containing both depth information and annotations of moving pedestrians, designed for depth estimation and video surveillance applications. Second, we contribute our distance estimation method, which can be applied to a lot of case studies: for monitoring the

J. Imaging 2021, 7, 158 17 of 19

this is the first dataset with indoor and outdoor omnidirectional videos of real scenarios, containing both depth information and annotations of moving pedestrians, designed for depth estimation and video surveillance applications. Second, we contribute our distance estimation method, which can be applied to a lot of case studies: for monitoring the physical distance between people in specific environments, for health regulations; to study the behavior of visitors in a place of interest, for tourist goals; in robotics, to help moving robots to plan their path and avoid obstacles; and in augmented reality applications, to improve user experiences. Despite some occasional cases in which the method cannot be applied, our approach is very simple, precise, and efficient, and does not need any complex calibration process to work. In our future work, we plan to study and implement innovative techniques to track people in omnidirectional videos, starting from our dataset, analyzing the specific geo- metric properties of the acquisition process and the different projection methodologies. Furthermore, we plan to expand our dataset by acquiring scenes in different environments and conditions in addition to using a multicamera setting.

Author Contributions: Conceptualization, G.M. and L.L.P.; Data Curation, G.M.; Methodology, G.M., L.L.P. and M.L.C.; Supervision, L.L.P., E.A. and M.L.C.; Writing—original draft, G.M.; Writing— review & editing, G.M., L.L.P. and M.L.C. All authors have read and agreed to the published version of the manuscript. Funding: This research was partially funded by the Italian PON IDEHA grant ARS01_00421 and the PRIN 2017 I-MALL grant 2017BH297_004. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Data Availability Statement: The CVIP360 dataset is publicly available for research purposes only and can be downloaded from: https://drive.google.com/drive/folders/1LMPW3HJHtqMGcIFXFo7 4x3t0Kh2REuoS?usp=sharing. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Rhee, T.; Petikam, L.; Allen, B.; Chalmers, A. Mr360: rendering for 360 panoramic videos. IEEE Trans. Vis. Comput. Graph. 2017, 23, 1379–1388. [CrossRef][PubMed] 2. Teo, T.; Lawrence, L.; Lee, G.A.; Billinghurst, M.; Adcock, M. Mixed reality remote collaboration combining 360 video and 3d reconstruction. In Proceedings of the 2019 CHI Conference on Human Factors in COMPUTING Systems, Glasgow, UK, 4–9 May 2019; pp. 1–14. 3. Gaspar, J.; Winters, N.; Santos-Victor, J. Vision-based navigation and environmental representations with an omnidirectional camera. IEEE Trans. Robot. Autom. 2000, 16, 890–898. [CrossRef] 4. Rituerto, A.; Puig, L.; Guerrero, J.J. Visual slam with an omnidirectional camera. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 348–351. 5. Monteleone, V.; Lo Presti, L.; La Cascia, M. Pedestrian tracking in 360 video by virtual PTZ cameras. In Proceedings of the 2018 IEEE 4th International Forum on Research and Technology for Society and Industry (IEEE RTSI), Palermo, Italy, 10–13 September 2018; pp. 1–6. 6. Monteleone, V.; Lo Presti, L.; La Cascia, M. Particle filtering for tracking in 360 degrees videos using virtual PTZ cameras. In Proceedings of the International Conference on Image Analysis and Processing, Trento, Italy, 9–13 September 2019; pp. 71–81. 7. Lo Presti, L.; La Cascia, M. Deep Motion Model for Pedestrian Tracking in 360 Degrees Videos. In Proceedings of the International Conference on Image Analysis and Processing, Trento, Italy, 9–13 September 2019; pp. 36–47. 8. Nayar, S.K. Catadioptric omnidirectional camera. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 482–488. 9. Abdelhamid, H.; Weng, D.; Chen, C.; Abdelkarim, H.; Mounir, Z.; Raouf, G. 360 degrees imaging systems design, implementation and evaluation. In Proceedings of the 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), Shenyang, China, 20–22 December 2013; pp. 2034–2038. 10. Corbillon, X.; Simon, G.; Devlic, A.; Chakareski, J. Viewport-adaptive navigable 360-degree video delivery. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–7. J. Imaging 2021, 7, 158 18 of 19

11. Elwardy, M.; Hans-Jürgen, Z.; Veronica, S. Annotated 360-Degree Image and Video Databases: A Comprehensive Survey. In Proceedings of the 2019 13th International Conference on Signal Processing and Communication Systems ICSPCS, Gold Coast, QLD, Australia, 16–18 December 2019. 12. Coors, B.; Alexandru, P.C.; Andreas, G. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European Conference on Computer Vision ECCV, Munich, Germany, 8–14 September 2018. 13. Delforouzi, A.; David, H.; Marcin, G. Deep learning for object tracking in 360 degree videos. In Proceedings of the International Conference on Computer Recognition Systems, Polanica Zdroj, Poland, 20–22 May 2019. 14. Mi, T.W.; Mau-Tsuen, Y. Comparison of tracking techniques on 360-degree videos. Appl. Sci. 2019, 9, 3336. [CrossRef] 15. Liu, K.C.; Shen, Y.T.; Chen, L.G. Simple online and realtime tracking with spherical panoramic camera. In Proceedings of the 2018 IEEE International Conference on Consumer Electronics ICCE, Las Vegas, NV, USA, 12–14 January 2018. 16. Yang, F. Using panoramic videos for multi-person localization and tracking in a 3d panoramic coordinate. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, Barcelona, Spain, 4–8 May 2020. 17. Chen, G.; St-Charles, P.-L.; Bouachir, W.; Joeisseint, T.; Bilodeau, G.-A.; Bergevin, R. Reproducible Evaluation of Pan-Tilt-Zoom Tracking, Technical Report. arXiv 2015, arXiv:1505.04502. 18. Demiröz, B.E.; Ari, I.; Ero˘glu,O.; Salah, A.A.; Akarun, L. Feature-based tracking on a multi-omnidirectional camera dataset. In Proceedings of the 2012 5th International Symposium on Communications, Control and Signal Processing, Rome, Italy, 2–4 May 2012. 19. Yogamani, S.; Hughes, C.; Horgan, J.; Sistu, G.; Varley, P.; O’Dea, D.; Uricar, M.; Milz, S.; Simon, M.; Amende, K.; et al. Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. 20. Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, present, and future of face recognition: A review. Electronics 2020, 9, 1188. [CrossRef] 21. Lo Presti, L.; La Cascia, M. Boosting Hankel matrices for face emotion recognition and pain detection. Comput. Vis. Image Underst. 2017, 156, 19–33. [CrossRef] 22. Lo Presti, L.; Morana, M.; La Cascia, M. A data association algorithm for people re-identification in photo sequences. In Proceedings of the 2010 IEEE International Symposium on Multimedia, Taichung, Taiwan, 13–15 December 2010; pp. 318–323. 23. Li, Y.H.; Lo, I.C.; Chen, H.H. Deep Face Rectification for 360◦ Dual-Fisheye Cameras. IEEE Trans. Image Process. 2020, 30, 264–276. [CrossRef][PubMed] 24. Fu, J.; Alvar, S.R.; Bajic, I.; Vaughan, R. FDDB-360: Face detection in 360-degree fisheye images. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; pp. 15–19. 25. Lo Presti, L.; Sclaroff, S.; La Cascia, M. Object matching in distributed video surveillance systems by LDA-based appearance descriptors. In Proceedings of the International Conference on Image Analysis and Processing, Vietri sul Mare, Italy, 8–11 September 2009; pp. 547–557. 26. Liu, Y.; Jiang, J.; Sun, J.; Bai, L.; Wang, Q. A Survey of Depth Estimation Based on Computer Vision. In Proceedings of the 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), Hong Kong, China, 27–30 July 2020; pp. 135–141. 27. Laga, H.; Jospin, L.V.; Boussaid, F.; Bennamoun, M. A survey on deep learning techniques for stereo-based depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2020.[CrossRef][PubMed] 28. Bhoi, A. Monocular depth estimation: A survey. arXiv 2019, arXiv:1901.09402. 29. Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China Technol. Sci. 2020, 63, 1–16. [CrossRef] 30. Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2024–2039. [CrossRef][PubMed] 31. Jiang, H.; Huang, R. High quality monocular depth estimation via a multi-scale network and a detail-preserving objective. In Proceedings of the 2019 IEEE International Conference on Image Processing ICIP, Taipei, Taiwan, 22–25 September 2019. 32. Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. 33. Li, B.; Dai, Y.; He, M. Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognit. 2018, 83, 328–339. [CrossRef] 34. Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3828–3838. 35. Wang, N.H.; Solarte, B.; Tsai, Y.H.; Chiu, W.C.; Sun, M. 360sd-net: 360 stereo depth estimation with learnable cost volume. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 582–588. 36. Jiang, H.; Sheng, Z.; Zhu, S.; Dong, Z.; Huang, R. Unifuse: Unidirectional fusion for 360 panorama depth estimation. IEEE Robot. Autom. Lett. 2021, 6, 1519–1526. [CrossRef] 37. Wang, F.E.; Yeh, Y.H.; Sun, M.; Chiu, W.C.; Tsai, Y.H. Bifuse: Monocular 360 depth estimation via bi-projection fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 462–471. J. Imaging 2021, 7, 158 19 of 19

38. Jin, L.; Xu, Y.; Zheng, J.; Zhang, J.; Tang, R.; Xu, S.; Gao, S. Geometric structure based and regularized depth estimation from 360 indoor imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 889–898. 39. Im, S.; Ha, H.; Rameau, F.; Jeon, H.G.; Choe, G.; Kweon, I.S. All-around depth from small motion with a spherical panoramic camera. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 156–172. 40. Zioulis, N.; Karakottas, A.; Zarpalas, D.; Daras, P. Omnidepth: Dense depth estimation for indoors spherical panoramas. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 448–465. 41. Zioulis, N.; Karakottas, A.; Zarpalas, D.; Alvarez, F.; Daras, P. Spherical view synthesis for self-supervised 360 depth estimation. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; pp. 690–699. 42. Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Zhang, Y. Matterport3d: Learning from rgb-d data in indoor environments. arXiv 2017, arXiv:1709.06158. 43. Armeni, I.; Sax, S.; Zamir, A.R.; Savarese, S. Joint 2d-3d-semantic data for indoor scene understanding. arXiv 2017, arXiv:1702.01105. 44. Handa, A.; Patraucean, V.; Stent, S.; Cipolla, R. Scenenet: An annotated generator for indoor scene understanding. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 5737–5743. 45. Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. 46. Wang, F.E.; Hu, H.N.; Cheng, H.T.; Lin, J.T.; Yang, S.T.; Shih, M.L.; Sun, M. Self-Supervised Learning of Depth and Camera Motion from 360◦ Videos. arXiv 2018, arXiv:1811.05304. 47. Available online: https://drive.google.com/drive/folders/1LMPW3HJHtqMGcIFXFo74x3t0Kh2REuoS?usp=sharing (accessed on 20 August 2021). 48. Available online: https://buy.garmin.com/en-US/US/p/562010/pn/010-01743-00 (accessed on 20 August 2021).