Evaluation of Super-Voxel Methods for Early Video Processing

Evaluation of Super-Voxel Methods for Early Video Processing Chenliang Xu and Jason J. Corso Computer Science and Engineering, SUNY at Buffalo fchenlian,[email protected] Abstract Suite of 3D Supervoxel Segmentations SWA GB GBH Mean Shift Nyström Supervoxel segmentation has strong potential to be incorporated into early video analysis as superpixel segmentation has in image analysis. However, there are many plausible supervoxel methods and little understanding as to when and where each is most appropriate. Indeed, we are not aware of a single comparative study on supervoxel segmentation. To that end, we study five supervoxel algorithms in the context of what we consider to be a good supervoxel: namely, spatiotemporal uniformity, object/region boundary detection, region compression and parsimony. For the eval- Benchmark Videos 3D Supervoxel Metrics uation we propose a comprehensive suite of 3D volumetric Figure 1. We comparatively evaluate five supervoxel methods on quality metrics to measure these desirable supervoxel char- 3D volumetric metrics that measure various desirable characteris- acteristics. We use three benchmark video data sets with tics of the supervoxels, e.g., boundary detection. a variety of content-types and varying amounts of human annotations. Our findings have led us to conclusive evidence that the hierarchical graph-based and segmentation perpixels have for image analysis. To that end, we perform by weighted aggregation methods perform best and almost a thorough comparative evaluation of five supervoxel meth- equally-well on nearly all the metrics and are the methods ods; note that none of these methods had been proposed of choice given our proposed assumptions. intrinsically as a supervoxel method, but each is either suf- ficiently general to serve as one or has been adapted to serve as one. The five methods we choose—segmentation by 1. Introduction weighted aggregation (SWA) [6, 33, 34], graph-based (GB) [10], hierarchical graph-based (GBH) [15], mean shift [29], Images have many pixels; videos have more. It has thus and Nystrom¨ normalized cuts [11, 12, 35]—broadly sam- become standard practice to first preprocess images and ple the methodology-space, and are intentionally selected videos into more tractable sets by either extraction of salient to best analyze methods with differing qualities for super- points [32] or oversegmentation into superpixels [31]. The voxel segmentation (see Figure 1 for examples). For exam- preprocessing output data—salient points or superpixels— ple, both the SWA and the Nystrom¨ method use the nor- are more perceptually meaningful than raw pixels, which malized cut criterion as the underlying objective function, are merely a consequence of digital sampling [31]. How- but SWA minimizes it hierarchically whereas Nystrom¨ does ever, the same practice does not entirely exist in video anal- not. Similarly, there are two graph-based methods that opti- ysis. Although many methods do indeed initially extract mize the same function, but one is subsequently hierarchical salient points or dense trajectories, e.g., [20], few methods (GBH). We note a similar selection of segmentation meth- we are aware of rely on a supervoxel segmentation, which is ods have been used in the (2D) image boundary comparative the video analog to a superpixel segmentation. In fact, those study [1]. papers that do preprocess video tend to rely on a per-frame Our paper pits the five methods in an evaluation on a superpixel segmentation, e.g., [21], or use a full-video seg- suite of 3D metrics designed to assess the methods on basic mentation, e.g., [15]. desiderata (Section 2.2), such as following object bound- The basic position of this paper is that supervoxels have aries and spatiotemporal coherence. The specific metrics great potential in advancing video analysis methods, as su- we use are 3D undersegmentation error, 3D segmentation accuracy, 3D boundary recall, and explained variation. We and textons. In this initial definition, there is no mention use three complementary video data sets to facilitate the of certain desiderata that one may expect, such as locality, study (two of the three have hand-drawn object or region coherence, and compactness. Rather than include them in boundaries). Our evaluation yields conclusive evidence mathematical terms, we next list terms of this sort as desir- that two of the hierarchical methods (GBH and SWA) per- able characteristics of a good supervoxel method. form best and almost equally-well on nearly all the met- We define a good supervoxel method based jointly on rics (Nystrom¨ performs best on the 3D undersegmentation criteria for good supervoxels, which follow closely from the error) and are the methods of choice given our proposed criteria for good segments [31], and the actual cost of gener- metrics. Although GBH and SWA are quite distinct in for- ating them (videos have an order of magnitude more pixels mulation and may perform differently under other assump- over which to compute). Later, in our experimental evalua- tions, we find a common feature among the two methods tion, we propose a suite of benchmark metrics designed to (and one that separates them from the other three) is the evaluate these criteria (Section 4.2). manner in which coarse level features are incorporated into Spatiotemporal Uniformity. The basic property of the hierarchical computation. We thoroughly discuss com- spatiotemporal uniformity, or conservatism [26], encour- parative performance in Section 4 after presenting a theoret- ages compact and uniformly shaped supervoxels in space- ical background in Section 2 and a brief description of the time [22]. This property embodies many of the basic methods in Section 3. Finally, to help facilitate the adoption Gestalt principles—proximity, continuation, closure, and of supervoxel methods in video, we make the developed symmetry—and helps simplify computation in later stages source code—both the supervoxel methods and the bench- [31]. Furthermore, Veksler et al. [40] show that for the case mark metrics—and processed video results on the bench- of superpixels, compact segments perform better than those mark and major data sets available to the community. varying in size on the higher level task of salient object segmentation. For temporal uniformity (called coherence in 2. Background [15]), we expect a mid-range compactness to be most appropriate for supervoxels (bigger than, say, five frames and 2.1. Superpixels less than the whole video). The term superpixel was coined by Ren and Malik [31] Spatiotemporal Boundaries and Preservation. The in their work on learning a binary classifier that can seg- supervoxel boundaries should align with object/region ment natural images. The main rationale behind super- boundaries when they are present and the supervoxel pixel oversegmentation is twofold: (1) pixels are not natu- boundaries should be stable when they are not present; ral elements but merely a consequence of the discrete sam- i.e., the set of supervoxel boundaries is a superset of ob- pling of the digital images and (2) the number of pixels ject/region boundaries. Similarly, every supervoxel should is very high making optimization over sophisticated mod- overlap with only one object [23]. Furthermore, the su- els intractable. Ren and Malik [31] use the normalized cut pervoxel boundaries should encourage a high-degree of ex- algorithm [35] for extracting the superpixels, with contour plained variation [26] in the resulting oversegmentation. If and texture cues incorporated. Subsequently, many super- we consider the oversegmentation by supervoxels as a com- pixel methods have been proposed [22, 23, 26, 40, 43] or pression method in which each supervoxel region is repre- adopted as such [5, 10, 41] and used for a variety of appli- sented by the mean color, we expect the distance between cations: e.g., human pose estimation [27], semantic pixel la- the compressed and original video to have been minimized. beling [17, 37], 3D reconstruction from a single image [18] Computation. The computation cost of the supervoxel and multiple-hypothesis video segmentation [39] to name a method should reduce the overall computation time re- few. Few superpixel methods have been developed to per- quired for the entire application in which the supervoxels form well on video frames, such as [8] who base the method are being used. on minimum cost paths but do not incorporate any temporal Performance. The oversegmentation into supervoxels information. should not reduce the achievable performance of the application. Our evaluation will not directly evaluate this char- 2.2. What makes a good supervoxel method? acteristic (because we study the more basic ones above). Parsimony. The above properties should be maintained supervoxel First, we define a —the video analog to a su- with as few supervoxels as possible [23]. perpixel. Concretely, given a 3D lattice Λ3 (the voxels in 3 the video), a supervoxel v is a subset of the lattice v ⊂ Λ 3. Methods such that the union of all supervoxels comprises the lattice S 3 T and they are pairwise disjoint: i vi = Λ ^ vi vj = We study five supervoxel methods—segmentation by ? 8i; j pairs. Obviously, various image/video features may weighted aggregation (SWA) [6, 33, 34], graph-based (GB) be computed on the supervoxels, such as color histograms [10], hierarchical graph-based (GBH) [15], mean shift [29], and Nystrom¨ normalized cuts [11, 12, 35]—that broadly rithm is proposed by Grundmann et al. [15]. Their al- sample the methodology-space among statistical and graph gorithm builds on an oversegmentation of the above spa- partitioning methods [1]. We have selected these five due tiotemporal graph-based segmentation. It then iteratively to their respective traits and their inter-relationships: for ex- constructs a region graph over the obtained segmentation, ample, Nystrom¨ and SWA both optimize the same normal- and forms a bottom-up hierarchical tree structure of the re- ized cut criterion. We describe the methods in some more gion (segmentation) graphs.

Evaluation of Super-Voxel Methods for Early Video Processing

The Application of Voxel Size Correction in X-Ray Computed Tomography for Dimensional Metrology

Temporal Voxel Cone Tracing with Interleaved Sample Patterns by Sanghyeok Hong

Human Body Model Acquisition and Tracking Using Voxel Data

Photorealistic Scene Reconstruction by Voxel Coloring

Online Detector Response Calculations for High-Resolution PET Image Reconstruction

Voxel Game Engine Using Blender Game Engine

Voxel Map for Visual SLAM

Future Graphics in Games

GPU Volume Voxelization Exploration of the Performance Characteristics of Different GPU-Based Implementations

Modifying Compressed Voxel Representations

Voxel Size Calibration for High-Resolution CT

Designing a Virtual Reality Game for the Pediatric Patient-Player Experience