Quadtree Convolutional Neural Networks

Quadtree Convolutional Neural Networks Pradeep Kumar Jayaraman, Jianhan Mei, Jianfei Cai, and Jianmin Zheng Nanyang Technological University, Singapore {pradeepkj@, jianhan001@e., ASJFCai@, ASJMZheng@}ntu.edu.sg Abstract. This paper presents a Quadtree Convolutional Neural Net- work (QCNN) for efficiently learning from image datasets representing sparse data such as handwriting, pen strokes, freehand sketches, etc. In- stead of storing the sparse sketches in regular dense tensors, our method decomposes and represents the image as a linear quadtree that is only re- fined in the non-empty portions of the image. The actual image data cor- responding to non-zero pixels is stored in the finest nodes of the quadtree. Convolution and pooling operations are restricted to the sparse pixels, leading to better efficiency in computation time as well as memory usage. Specifically, the computational and memory costs in QCNN grow linearly in the number of non-zero pixels, as opposed to traditional CNNs where the costs are quadratic in the number of pixels. This enables QCNN to learn from sparse images much faster and process high resolution images without the memory constraints faced by traditional CNNs. We study QCNN on four sparse image datasets for sketch classification and simplification tasks. The results show that QCNN can obtain comparable accuracy with large reduction in computational and memory costs. Keywords: quadtree, neural network, sparse convolution 1 Introduction Convolutional neural networks (CNNs) are a powerful and popular method for various tasks involving the analysis of images, videos and three-dimensional objects. Most of the real world image data such as natural photographs or volumetric meshes can be represented as dense tensors, and indeed, conventional CNNs were originally proposed to optimally learn features from such data by local weight connections and parameter sharing. On the other hand, it is ob- served that some datasets are sparse in nature. For example, images representing freeform 2D sketches and handwriting only consist of a set of one-dimensional lines occupying a sparse subset of the 2D image plane, while point clouds or triangle meshes are only defined in a small subset of the 3D space. Unfortu- nately, most of the traditional CNN architectures, particularly for images, are unable to exploit the sparsity of such data, and learning from such datasets is unnecessarily inefficient in both training time and memory consumption. This is particularly of concern with the rise of deep networks that are being increasingly employed to various high resolution sparse images in applications such as sketch simplification [1], as well as to resource-scarce mobile or embedded devices. 2 Pradeep Kumar Jayaraman, Jianhan Mei, Jianfei Cai and Jianmin Zheng In the case of 3D data, convolutional neural networks were originally designed by voxelizing the mesh into dense 3D tensors [2]. Due to memory and computational constraints, however, this approach does not scale to high resolutions. To alleviate this, recent works such as OctNets [3], O-CNN [4], and OGN [5] decompose the 3D meshes hierarchically into octrees and adapt CNN operations to consider the special octree structure. Inspired by such 3D works, we present in this paper a quadtree convolutional neural network (QCNN) for efficiently learning from sparse 2D image datasets. While those 3D networks were designed to deal with 3D shapes represented by meshes or point clouds, we target general sparse images which usually have more arbitrary structure or topology. This will enable the developed method to have wide applications, especially on mobile devices where the computing power and memory are limited. Our main idea is to decompose sparse images into quadtrees, store the non-zero image pixels in the finest nodes, and design special data representation that takes the features of CPU and GPU into consideration. The computation effort will be concentrated on the areas of interest, which avoids the storage of empty pixels that do not provide meaningful information and thus reduces the memory consumption. We start with the finest level of the quadtree and perform convolutions on these nodes to compute the features, followed by pooling which downsamples the features and propagates them to the next coarser quadtree level. This operation can be stacked multiple times before finally obtaining the network output with respect to some predefined target and loss function. Our approach has several advantages in terms of efficiency. First, since we only store non-zero pixels of the image in the bottom most level of the sparse quadtree, the storage and computational requirements are linear in the number of non-zero pixels and completely independent of image resolution. Second, it is well known that modern CPUs and GPUs are highly efficient in processing data that are contiguous in memory. Hence we use a linear quadtree representation where each level of the quadtree is stored as a linear 1D array by indexing the quadtree nodes with space-filling z-order curves. Third, we adapt CNN operations in the quadtree by considering the special data representation. Convolution that requires neighborhood access for each quadtree node in the same depth is achieved via an efficient look-up table based scheme using the Moser-de Bruijn sequence, and pooling is as simple as assigning the maximum/average of ev- ery four children nodes to their parent nodes. We demonstrate the efficiency of QCNN in terms of computational effort on several sparse image datasets for classification and sketch simplification. 2 Related Work 2.1 Hierarchical Data Representation The quadtree [6] and octree [7] are hierarchical representations of 2D and 3D spatial data, respectively, and generalizations of the binary tree. They have been extensively used in various graphics and image processing applications such as Quadtree Convolutional Neural Networks 3 collision detection, ray tracing, and level-of-details [8–10]. It is common to im- plement quadtrees and octrees using pointers. However, for representing data hierarchically for CNN training purposes, this is infeasible since CPUs and GPUs are efficient in processing contiguous array data. Linear quadtrees or octrees [11], where 2D/3D node indices in each level of the tree are converted to 1D indices using space-filling curves, are more relevant to our application. 2.2 Sparse Convolutional Neural Networks Using sparsity can result in higher resolution inputs to be processed efficiently. However, there are only a few network architectures that exploit sparsity. Ini- tially, CNNs were employed to process 3D data by voxelizing the meshes into 3D dense volumetric tensors [2]. Since this representation has a high computation and memory cost, the input resolution had to be restricted to around 303. Gra- ham proposed a sparse version of the CNN for 2D image [12] and 3D voxel [13] data that only performs convolutions on non-zero sites and their neighbors within the receptive field of the kernel. Nevertheless, the approach becomes inefficient when a large number of convolution layers are placed in between the pooling layers since the feature map dilates after each convolution. The feature dilation problem was recently handled by Graham and Maaten [14] by restricting convolutions only on the non-zero sites. Their works require additional book-keeping for indexing the non-zero pixels for each layer’s output, as well as efficient hash table implementations. Quadtree/octree structures on the other hand can be computed in one shot and clearly define the structure of the data beforehand, independently of the convolution or pooling parameters in the network. Additionally, they can be linearly represented as a simple contiguous array, thanks to their regular structure. Moreover, simply conforming the feature maps to the quadtree/octree structure is sufficient to significantly prevent feature dilation. To support high resolution 3D data, Riegler et al. [15] combined octree and a grid structure, and limited CNN operations to the interior volume of 3D shapes. While this is efficient com- pared to using dense voxels, storing the interior volume of 3D surface data is still wasteful. Wang et al. [4] only considered the surface voxels of the 3D data in the octree representation and drastically improved memory and computational costs in performing CNN operations. Similar to octrees, our work introduces the quadtree structure for efficiently learning from sparse image data. 3 Quadtree Convolution Neural Network 3.1 Motivation Consider a general scenario where a dense n-dimensional tensor used to represent some input that is to be fed into a convolutional neural network. This tensor could represent grayscale images (n = 2), color images (n = 3), voxels from 3D points clouds or surfaces (n = 3), etc. Sparsity arises in an n-dimensional tensor 4 Pradeep Kumar Jayaraman, Jianhan Mei, Jianfei Cai and Jianmin Zheng whenever it is used to represent a lower-dimensional (<n) manifold. Examples include a set of freehand pen strokes that are 1-manifolds in 2D grayscale images and triangle meshes that are 2-manifolds stored in 3D volumes. Even if the object of interest only occupies a small portion of the space it resides in, the storage costs of a representing such objects with a dense tensor grows in the order of n with increasing resolution, as does the computational cost of applying convolutions to extract feature maps. For example, in this paper where we mainly consider sparse grayscale image data (n = 2), an N ×N image requires a storage cost of N 2, and convolving M × M kernels to compute C feature maps requires M 2N 2C multiply-accumulate (MACC) operations (assuming unit stride), both of which are of quadratic complexity in the total number of pixels. By representing the image as a quadtree which is only subdivided when non- zero pixels exist in a quadrant, the non-zero pixels (without the loss of generality) in the input image correspond to the nodes in the finest quadtree level.

Load more