IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018 235 The Manhattan Frame Model—Manhattan World Inference in the Space of Surface Normals

Julian Straub , Member, IEEE, Oren Freifeld , Member, IEEE, Guy Rosman, Member, IEEE, John J. Leonard, Fellow, IEEE, and John W. Fisher III, Member, IEEE

Abstract—Objects and structures within man-made environments typically exhibit a high degree of organization in the form of orthogonal and parallel planes. Traditional approaches utilize these regularities via the restrictive, and rather local, Manhattan World (MW) assumption which posits that every plane is perpendicular to one of the axes of a single coordinate system. The aforementioned regularities are especially evident in the surface of a scene where they manifest as orthogonally-coupled clusters. This motivates the introduction of the Manhattan-Frame (MF) model which captures the notion of an MW in the surface normals space, the unit sphere, and two probabilistic MF models over this space. First, for a single MF we propose novel real-time MAP inference algorithms, evaluate their performance and their use in drift-free rotation estimation. Second, to capture the complexity of real-world scenes at a global scale, we extend the MF model to a probabilistic mixture of Manhattan Frames (MMF). For MMF inference we propose a simple MAP inference algorithm and an adaptive Markov-Chain Monte-Carlo sampling algorithm with Metropolis-Hastings split/merge moves that let us infer the unknown number of mixture components. We demonstrate the versatility of the MMF model and inference algorithm across several scales of man-made environments.

Index Terms—Manhattan world, Manhattan frame, mixture of Manhattan frames, world representation, surface normal distribution, bayesian models Ç

1INTRODUCTION IMPLIFYING assumptions about the structure of the sur- description of a scene—one that is composed of multiple Sroundings facilitate reasoning about complex environ- MWs of different orientations. ments. On a wide range of scales, from the layout of a city When reasoning about the composition of a scene in to structures such as buildings, furniture and many other terms of MWs, we directly utilize the surface-normal dis- objects, man-made environments often lend themselves to tribution of the scene, rather than working with a plane a description in terms of parallel and orthogonal planes. segmentation. This is motivated by the observation that This intuition is formalized as the Manhattan World (MW) across a wide range of scales, man-made environments assumption [1] which posits that most man-made structures exhibit structured, low-entropy surface-normal distri- may be approximated by planar surfaces that are parallel to butions as displayed in Fig. 1. Additionally, surface-normal one of the three principal planes of a common orthogonal distributions are invariant to translation and scale [2] which coordinate system. makes the proposed approach largely independent of the At a local level, this assumption holds for parts of city measurement and 3D reconstruction process. Finally, layouts, most buildings, hallways, offices and other man- surface normals are straightforward to extract from most made environments. However, the strict MW assumption 3D scene representations such as depth images [3], [4], point cannot represent many real-world scenes on a global level: clouds [5] and meshes [6], [7]. a rotated desk inside a room, more complex room and city We introduce the notion of the Manhattan Frame (MF) layouts (as opposed to planned cities like Manhattan). which represents the MW structure in the space of surface While local parts of the scene can be modeled as an MW, normals, i.e., the unit sphere, as orthogonally-coupled the entire scene cannot. This suggests a more flexible clusters. Modeling surface-normal noise with two different distributions on the sphere, we formulate two probabilistic generative MF models. Depending on the model, real-time J. Straub, G. Rosman, and J.W. Fisher III are with the Department of Electri- MAP inference is carried out in closed form or via opti- cal Engineering and Computer Science, MIT, Cambridge, MA 02139. mization on the manifold of rotation matrices SOð3Þ. E-mail: {jstraub, rosman, fisher}@csail.mit.edu. Additionally, these models are used to construct a mixture O. Freifeld is with the Department of Computer Science, Ben-Gurion of MFs (MMF) to represent complex scenes composed University, Be’er Sheva 84105, Israel. E-mail: [email protected]. J.J. Leonard is with the Department of Mechanical Engineering, MIT, of multiple MFs. For the MMF model, we derive a simple Cambridge, MA 02139. E-mail: [email protected]. MAP inference algorithm and a Gibbs-sampling-based Manuscript received 12 Jan. 2016; revised 3 Dec. 2016; accepted 9 Jan. 2017. algorithm, that using Metropolis-Hastings [12] split/merge Date of publication 31 Jan. 2017; date of current version 12 Dec. 2017. proposals [13], adapts the number of MFs to best capture Recommended for acceptance by D. Hoiem. the surface-normal distribution of a scene. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. We examine the properties of the proposed models Digital Object Identifier no. 10.1109/TPAMI.2017.2662686 and inference algorithms in a variety of qualitative and

0162-8828 ß 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html_ for more information. 236 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018

Fig. 1. Across scales, surface normals of man-made environments exhibit characteristic patterns. This work establishes the connection between 3D Manhattan-World structures and their surface-normal distri- butions via the Manhattan-Frame model. quantitative experiments. These demonstrate the expres- Fig. 2. Structure assumptions about scenes in terms of expressiveness. siveness and versatility of the MF and MMF models across The proposed MMF subsumes both the Atlanta and the Manhattan scales: depth images of a single view of a scene, larger World models. indoor reconstructions and large-scale aerial LiDAR data of from single-image scene parsing [15] and 3D reconstruc- an urban center. tion [16], [17], [18], [19], VP direction estimation for rotation estimation with respect to man-made environments [1], 2RELATED WORK [20], [21], [22] to VP direction tracking over time to estimate The different assumptions made in the literature about the camera rotation and scene structure [23], [24], [25], [26], [27]. environment can be categorized in terms of their expressive- While early VP extraction algorithms relied on image ness as depicted in Fig. 2. The assumptions range from gradients [1], [8], most modern algorithms operate on line mostly unrestricted representations such as point clouds, segments extracted from the image. This has been found to mesh and level-set, which can in the limit represent any sur- yield superior direction estimation results over dense image- face exactly, to the rather strict MW assumption as indicated gradient approaches [28]. Generally, VPs are extracted by by the inclusion diagram. The proposed MMF assumption intersecting lines in the image. These intersections are often subsumes the Atlanta World (AW) which in turn subsumes found after mapping lines to the unit sphere [23], [29], [30], the MW assumption. The MMF provides a directional seg- or into other accumulator spaces [31]. Introduced in [15], mentation under the orthogonality constraints imposed by horizon estimation has emerged as a benchmark for VP esti- the MW assumption. Relaxing the orthogonality constraints mation algorithms [32], [33]. completely, we arrive at what we term the Stata Center Many VP extraction algorithms rely on the MW assump- World (SCW).1 It captures only the directional composition tion [1], [20], [22], [25], [26], [32], [34], [35] which is mani- of a scene. Planar scene representations differ from the fested as three OVPs. Incorporating the MW assumption into SCW in that different planes with the same orientation are the VP estimation algorithms not only increases estimation separated in space. accuracy (if the MW assumption holds) [31] but also allows These different assumptions about scenes can be estimation of the focal length of the camera [20], [31], [32], observed directly in the 3D structure or indirectly in the [35], [36], [37], and rejection of spurious VP detections. projection of the 3D structure into a camera [14]. Models Another avenue of research uses the MW assumption for sin- and inference algorithms based on the former utilize 3D gle-image 3D reconstruction [16], [17], [18], [19]. The inferred representations such as meshes, point clouds and derived MW and associations of lines to MW axes combined with data such as surface normals. Intersections of planes in 3D geometric reasoning are used to reconstruct the 3D scene are lines which can be observed as lines in the image space. in [16], [17]. Hedau et al. [18] use an MW prior to iteratively A Vanishing Point (VP) is the intersection of multiple infer the 3D room layout and segment out clutter in the such lines where the 3D lines are all parallel to each other. room. Liu et al. [19] use a floor plan in conjunction with a set Models built on VPs usually use image gradient orienta- of monocular images to reconstruct whole apartments. tions directly or indirectly via line segment extraction. The AW model of Schindler et al. [8] assumes that the Specifically, the MW is manifested as orthogonally-coupled world is composed of multiple MWs sharing the same VPs (OVPs) in the image space and an MF in the surface- z-axis (which is assumed to be known). This facilitates infer- normal space. Multiple MWs cause multiple OVPs and ence from RGB images as only a single angle per MW has to MFs. The SCW can be observed via independent VPs in the be estimated as opposed to a full 3D rotation. The approach image or independent surface normal clusters. by Antunes et al. [38] infers the full MMF from RGB images. 2D Image-Space. There is a vast literature on VP estima- Relaxing the assumptions about the scene, VPs can be tion from RGB images. The goals for VP estimation range extracted independently [15], [21], [23], [24], [28], [30], [31], [33], [39] akin to the SCW assumption. 1. see http://web.mit.edu/facilities/construction/completed/stata. 3D Representations. There are many approaches that rely html purely on 3D representations of surfaces and scenes. STRAUB ET AL.: THE MANHATTAN FRAME MODEL—MANHATTAN WORLD INFERENCE IN THE SPACE OF SURFACE NORMALS 237

Assumptions, such as the MW or SCW, are used to align scenes into a common frame of reference for scene segmen- tation and understanding [40], [41], and to regularize 3D reconstruction [42], [43]. The AW and MMF model could be used similarly. Similar to the work on VPs, the MW assumption has been used most commonly [40], [41] in 3D as well. This is proba- Fig. 3. An MW structure maps to a MF in the surface normal space and bly due to the fact that man-made environments exhibit to three orthogonal VPs in the image plane. strong MW characteristics on a local scale, i.e., on the level of a single RGBD frame of a scene. In the application of 3THE MANHATTAN FRAME Simultaneous Localization and Mapping (SLAM) [44], the The Manhattan Frame is the image of a 3D MW structure MW assumption has been used to impose constraints on the under the Gauss Map as depicted in Fig. 3. In other words, inferred map [43]. Our original idea of the MF [9] has been the MF describes the notion of the MW in the space of sur- adapted by Ghanem et al. [45] who propose a robust infer- face normals. In a noise-free, perfect MW the surface nor- ence scheme for MF estimation (RMF) and by Joo et al. [46] mals would align with the six directions collected as who use a branch-and-bound scheme to perform real-time columns in globally optimal MF inference (MF BB). 2 3 To the best of our knowledge the assumption of multi- 1 10000 E 4 5;e j E: ple MWs in the 3D data setting (as opposed to RGB 2D- ¼ 00110 0 j denotes the th col. of images) has not been explored prior to our own work 000011 in [9], which was a preliminary and partial version of this (1) In the camera coordinate system these six directions will manuscript. appear rotated by R, an element of SOð3Þ the space of ortho- Similar to the MF and MMF model, the SCW can be normal matrices in 3D with determinant 1 inferred solely from surface-normal distributions [10],

[11]. Monszpart et al. [42] couple a local plane-based M ¼ RE; mj denotes the jth col. of M: (2) approach with global directional regularity constraints to regularize 3D reconstructions of man-made environments Note that the rotation of the camera, R, is unknown and from point clouds. Gupta et al. [41] assume the only rele- hence a key parameter to be estimated by an inference algo- vant direction for semantic scene segmentation is the rithm. In other words, if a 3D scene consists of only planar direction of gravity to enable alignment of the ground surfaces such that the set of their surface normals is con- 6 M plane across scenes. They propose a simple algorithm to tained in the set fmjgj¼1, then captures all possible direc- segment the scene into the gravity and all other directions tions in the scene—the scene follows the MW assumption. 2 based on surface-normal observations. Triebel et al. [47] Specifically, let qi 2 S denote the ith observed surface extract the main directions of planes in a scene using a normal. A latent label, zi 2f1; ...; 6g, assigns qi to a spe- hierarchical Expectation-Maximization (EM) approach. cific signed axis within the MF. Hence mzi is the signed Using the Bayesian Information Criterion (BIC) they infer axis associated with qi. In the following we will denote z N the number of main directions as well. Note that the MF the set of all labels as z ¼f igi¼1 and the set of all surface q N andMMFmodelcouldbeinferredfromtheSCWby normals as q ¼f igi¼1. The unit normals are elements of grouping inferred directions into MFs. the unit sphere in R3,denotedbyS2,a2Dmanifold An alternative to the MW, MMF or SCW model whose geometry is outlined in Section 3.2.1. Commonly, describes man-made structures by individual planes with in 3D processing pipelines (e.g., in surface fairing or no constraints on their relative normal directions. The reconstruction [50], [51]), the unit normals are estimated orthogonality constraints in the MW or MMW models from noisy measurements of the 3D scene structure such enable statistical pooling of measurementsacrossdifferent as depth images [3], [4], point clouds [5] and meshes [7]. orientations. This means not only that fewer measure- Due to these noise sources and potentially imperfect 3D ments (per plane) are needed to achieve the same amount MW structure, the surface normals q may deviate from of accuracy as without those constraints but also that reli- their associated MF axis. able measurements from one or more directions help in In order to fit the parameters of an MF, one would seek to handling cases where there are only few reliable measure- penalize those deviates. While, in principle, this can be for- ments from other directions. mulated directly as a deterministic optimization, we adopt a 2D & 3D. The connection between VPs in images and 3D probabilistic modeling approach. This allows us to derive a MW structures has been used to infer dense 3D structure real-time algorithm for single MF inference as well as from sets of images by Furukawa et al. [48]. They employ a MCMC inference for the more complicated MMF model all greedy algorithm for a single-MF extraction from normal from the same base MF model. Another key advantage over estimates that works on a discretized sphere. Neverova deterministic approaches is that the MCMC algorithm infers et al. [49] integrate RGB images with associated depth data all model parameters to facilitate reasoning about uncer- from a Kinect camera to obtain a 2.5D representation of tainty which is important for scene understanding. Further- indoor scenes under the MW assumption. Silberman more, using MCMC inference, the MMF model can be et al. [40] infer the dominant MW using VPs extracted from integrated into larger and more complex environment mod- the RGB image and surface normals computed from the els, as we showed recently in [7]. Another approach would depth image. be to utilize a non-parametric directional segmentation 238 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018

Fig. 5. Left: The unit sphere S2 in 3D. The blue plane on the sphere illus- 2 2 trates TpS , the tangent space to S at point p. Middle and right: 2D vMF Fig. 4. Depictions of the two proposed MF noise models. distributions. algorithm such as [11] or [10] and to fit MFs to the inferred 3.2 Tangent Space Gaussian MF Model modes of the surface normal distribution. The advantage of The tangent-space Gaussian MF (TG-MF) model describes directly inferring an MF model is that data from the different the deviates not on S2 directly but in a tangent plane. To (orthogonal and opposing directions) all jointly contribute to explain this concept, we touch upon some differential-geo- the estimation of the MF orientation. This is especially metric notions before describing the TG-MF model. important in scenes like the urban scene in Fig. 1 where there are only few data points for some of the directions. To this 3.2.1 The Manifold of the Unit Sphere S2 end, we propose two different noise models to describe those As alluded to earlier, surface normals lie on the unit sphere random deviations: tangent space Gaussian (TG) noise as in 3D, S2. This space is a 2-dimensional Riemannian mani- well as von-Mises-Fisher (vMF) noise. fold whose properties we review in the following. Let p and q be two points on the unit sphere in 3D, S2, 3.1 The Probabilistic MF Model 2 2 and let TpS denote the tangent space to S at point p. Then Let R 2 SOð3Þ denote the rotation of the MF. Making no a- T T 2 3 T priori assumptions about which orientation of the MF is p p ¼ q q ¼ 1 and TpS ¼fx : x 2 R ; x p ¼ 0g : (8) more likely than others, R is distributed uniformly 2 2 3 Note that while S is a nonlinear manifold, TpS R is a 2- R UnifðSOð3ÞÞ: (3) dimensional linear space as depicted in Fig. 5. It can be shown ([56], [57]) that the length of the shortest path along Since SOð3Þ is a manifold with finite support, we can com- p q 2 the manifold, also called the geodesic, between and is pute its volume and obtain 8p [52] which implies that all given by the angle between p and q 1= rotations have equal likelihood of 8p2 . T As is standard in Bayesian mixture modeling, the MF dGðp; qÞ¼arccosðp qÞ: (9) axis assignments zi of a surface normal qi to an MF axis are assumed to be distributed according to a categorical distri- Furthermore, the Riemannian exponential map Expp : 2 2 2 bution CatðwÞ with a prior parameter- TpS ! S maps a point x in the tangent space TpS around 2 ized by g p onto the sphere S x zi CatðwÞ; w DirðgÞ: (4) x 7! p cos ðjjxjjÞ þ sin ðjjxjjÞ: (10) jjxjj The deviations of the observed normals from their signed The inverse of Expp, the Riemannian logarithm map axis are modeled by a directional distribution parameterized 2 2 Logp : S =fpg!TpS can be computed as by Q. We only require this directional distribution to have the assigned MF axis mzi as its mode. Following a Bayesian u p Q q 7!ðq p cos uÞ ; (11) approach we assume a prior ð ; Þ for the parameters sin u

qi pðqi j zi;R;QÞ; Q pðQ; Þ (5) where u ¼ dGðp; qÞ. Since u is computed as the angle between p and q it is upper bounded by p for q ¼p. In that case the p q z ;R;Q ; s.t. mzi ¼ arg max ð j i Þ (6) logarithm map becomes singular and is thus defined q2S2 over the entire sphere except the antipodal point p. The where are the so-called hyperparameters of the prior. Riemannian logarithm map can be thought of as a lineariza- S2 p Many directional distributions exist (e.g., [53], [54], [55]) tion of nf g to a disk with radius p excluding the bound- and most are valid choices for the distribution of surface ary of the disk. normals. We focus on the tangent-space Gaussian (Section In other words, the geodesic distance between two unit T S2 S2 3.2) and the von-Mises-Fisher distribution (Section 3.3) as normals is the angle between them, Expp maps p onto depicted in Fig. 4. and Logp performs the inverse mapping. Note that Expp p Finally, the joint distribution for the MF model is given as and Logp depend on . For further details and an introduc- tion to Riemannian geometry see [57]. 1 Note that in the formulas for the Riemannian exp and log pðz; q;w;R;Q; g;Þ¼ pðw; gÞpðQ; Þ 2 8p2 map points x 2 TpS are 3D vectors in the ambient Euclidean YN (7) space R3 fulfilling Eq. (8). In the following we express x in w p q z ;R;Q : zi ð i j i Þ local tangent space coordinates as x~ by parallel transporting i¼1 all data into the tangent space around the north pole STRAUB ET AL.: THE MANHATTAN FRAME MODEL—MANHATTAN WORLD INFERENCE IN THE SPACE OF SURFACE NORMALS 239 n n 2 T n ¼ð0; 0; 1Þ using a rotation Rp . Vectors Rp x 2 TnS all have vMFðq; m; tÞ¼ZðtÞexp tqi m ; (17) n xz ¼ 1 and hence the 2D x~ is obtained from Rp x by discarding Z the z coordinate. In the following all this process is implicitly where ðtÞ is the normalizing constant. See Fig. 5 for a 2D assumed whenever a point on the sphere is mapped into a illustrative example. Under the vMF model with concentra- Q q S2 tangent space via the log map. Likewise whenever the exp tion ¼ t, a surface normal i 2 is distributed as map is used the process is performed in the inverse direction: p qi zi;R;Q vMF qi; m ; t ; 18 first augment the z direction with 1, then rotate the resulting j ¼ ð zi Þ ( ) p nT point down to the tangent space via Rn ¼ Rp . a n pðQ; Þ/ZðtÞ expðbtÞ ;¼fa; bg ; (19) The rotation Rp is computed via the axis-angle formula- np T tion with axis w ¼ n p and angle u ¼ arccosðn pÞ using p Q k k2 where the prior ð ; Þ on the concentration parameter of Rodrigues’ formula: the vMF is only known up to proportionality [65].

Rn w w 2 ; p ¼ I þðsin uÞ½ þð1 cos uÞ½ (12) 4REAL-TIME MF MAP INFERENCE w where ½ denotes the construction of a skew-symmetric Based on the probabilistic generative models for the MF matrix from a vector w as [58] setup in the previous sections, we now develop real-time MF 2 3 (RTMF) maximum-a-posteriori (MAP) inference methods. 0 w3 w2 These algorithms are used to infer the local MF structure of w 4 w w 5: ½ ¼ 3 0 1 (13) an environment efficiently. Starting from the TG-MF model w2 w1 0 we first derive the MAP inference algorithm directly before employing an approximation that yields more efficient infer- Note that since u½w is a skew-symmetric matrix, it is an ence. Lastly, the vMF-MF MAP inference is derived. Those element of the Lie Algebra soð3Þ associated with SOð3Þ. The three MF algorithms are instantiations of the hard-assign- Lie-group exponential map from soð3Þ to SOð3Þ is equiva- ment expectation maximization algorithm (EM): We iterate lent to Rodrigues’ formula in Eq. (12) [59]. assigning surface normals to the most likely MF axis and updating the MF rotation estimate until convergence. 3.2.2 The TG-MF Model In this section, for efficiency reasons and in the absence of Under the TG-MF model the deviations of the observed nor- further knowledge about the scene, the surface normals are mals from their assigned axis are modeled by a 2D zero-mean assumed to be generated with equal probability from any of Q ¼ S 2 R22 w 1 Gaussian distribution with covariance in the the axes, i.e., all j ¼ 6. For the same reason we assume iden- 2 tangent space around the axis [78]. The conjugate prior distri- tical isotropic covariances Sj ¼ s I for all TG-MF axes and S bution for covariance matrices is the inverse Wishart (IW) identical concentration parameters tj for all vMF-MF axes. distribution [60] parameterized by ¼fD 2 R22; n 2 Rg In Section 5, the TG-MF assumptions will be relaxed.

pðqi j zi;R;QÞ¼NðLog ðqiÞ;0; Szi Þ ; (14) mzi 4.1 Direct MAP MF Estimation for the TG-MF Starting from the tangent-space Gaussian MF model as set p Q S D; ; ð ; Þ¼IWð zi ; nÞ (15) up in Section 3.2, we derive the direct MAP MF rotation esti- 2 mation algorithm. The posterior over assignments zi of sur- qi T S where Logmz ð Þ2 mzi . In other words, we evaluate the i 2 face normals qi to axis of the MF is given by probability density function (pdf) of qi 2 S by first mapping T S2 it into mzi and then evaluating it under the Gaussian dis- p z j R; q ; S w q ; S : 22 i ¼ j i; p / jN Logmj ð iÞ;0 (20) tribution with covariance Szi 2 R . The pdf of the surface normals over the nonlinear S2 is then induced by the Rie- Therefore the MAP estimate for the label zi becomes mannian exponential map T 1 zi ¼ arg min Log ðqiÞ S Log ðqiÞ qi Exp Nð0; Sz Þ : 16 j mj mj mzi i ( ) 2f1...6g (21) 2 qT ; The range of Logp is contained within a disk of finite radius ¼ arg min arccosð i mjÞ j2f1...6g (p) while the Gaussian distribution has infinite support. Con- sequently, for probabilistic inference, we use the inverse qT q where we have used arccosð i mjÞ¼jjLogmj ð iÞjj2 and Wishart prior to favor small covariances resulting in a proba- assumed that the covariance S is isotropic. With bility distribution that, except a negligible fraction, is within pðRÞ¼UnifðSOð3ÞÞ, the posterior over the MF rotation R is the range of Logp and concentrated about the respective axis. p Rjq; z; S / p qjz;R; S pðRÞ/p qjz;R; S 3.3 von-Mises-Fisher MF Model YN (22) In this section we introduce modeling the surface normals ¼ N Log ðqiÞ;0; S : mzi as von-Mises-Fisher [55], [61] distributed. This distribution i¼1 is natively defined over the manifold of the sphere and com- Working in the log-domain, the MAP estimate for R is monly used to model directional data [11], [62], [63], [64]. It D $ defines an isotropic distribution for -dimensional direc- R p R ; S f R : D1 D1 ¼ arg min log jq z; :¼ arg min ð Þ (23) tional data q 2 S around a mode m 2 S with a concen- R R tration t 2 R and has the following form 240 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018

With the posterior in Eq. (22), the cost function fðRÞ is "# YN fðRÞ¼log N Log ðqiÞ;0; S mzi i¼1 (24) XN T 2 q ; qi / arccos i mzi Fig. 6. The geometry of the approximation of Logmð Þ. i¼1 XN where we have used a derivation similar to Eq. (21). We call T 1 fðRÞf~ðRÞ/ Log ðqiÞ S Log ðqiÞ this method direct since the cost function directly penalizes mzi mzi i¼1 (28) a normal’s deviation from its associated MF axis. X6 R RT R I det R 1 2 T We enforce the constraints on ( ¼ & ð Þ¼ ) / jI jj arccos q~j mj ; by explicitly optimizing the cost function on the SOð3Þ man- j¼1 ifold using gradient descent with backtracking linesearch. More details can be found in [56], [66], [67]. To derive the where we have used that theP sample mean in the tangent q Jacobian needed for the optimization we use that perturba- space of their Karcher mean i2Ij Logq~j ð iÞ¼0 by defini- tions of R from R0 can be written as RðvÞ¼R0ExpðvÞ tion. With mk ¼ Rek the Jacobian for f~ðRÞ is w where v ¼ u½ 2 soð3Þ and Exp is the exponential map ~ X qT Rm  à from soð3Þ to SOð3Þ as introduced in Eq. (12). Then the Jaco- @f RðvÞ 2 I k arccos ~j k T @fðRðvÞÞ J qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q~ Rek : J f ¼ ¼ k (29) bian is ¼ @v jv¼0, the derivative of with respect to @v T 2 k 1 q~ Rek the perturbation v 2 R3 at v ¼ 0 k Thus the gradient descent optimization over R only utilizes X 2 6 @ arccos ðxÞ @ q~j J qT R e the Karcher means f gj¼1, which can be pre-computed ¼ @x @ i ExpðvÞ zi since the labels are fixed for the rotation estimation. This i v v¼0 (25) eliminates the costly iteration through all data-points at X qT Re each gradient descent iteration. 2q arccosffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið i zi Þ qT Re : ¼ i ½ zi i qT Re 2 1 ð i zi Þ 4.3 MAP Inference in the MF-vMF Model In this section, we derive the MAP inference for the vMF- Backtracking line search in the direction of the negative MF model and show that the structure of the vMF distribu- Jacobian until the Armijo conditions are met provides an tion allows the MF rotation to be computed in closed form. appropriate step size d which allows us to obtain a new rota- 1 With the uniform distribution over labels, i.e., pj ¼ ,the tion estimate using the exponential map 6 posterior distribution over label zi follows the proportionality Rt ¼ Rt1Exp dJt : (26) T p zi ¼ jjqi;R; t / vMF qi; mj; t / exp tqi mj : (30)

4.2 Approximate MAP MF Rotation Estimation Since we assume equal concentration parameter t for the six z The direct approach derived in the previous section is ineffi- vMF distributions, the MAP assignment for i is cient since the cost function in Eq. (24) and the respective T zi ¼ arg max qi mj: (31) Jacobian involve a sum over all data-points. The Jacobian j2f1;...;6g needs to be re-computed after each update to R and the cost function multiple times during the backtracking linesearch. The posterior distribution over the MF’s rotation is To address this inefficiency, we derive an approximate esti- p R ; p ;R p R p ;R mation algorithm by exploiting the geometry of S2. jq z; t / qjz ; t ð Þ/ qjz ; t The approximation necessitates the computation of the YN (32) q 6 q ¼ vMF qijmz ; t : Karcher means f~jgj¼1 for each of the sets of normals, f igIj , i i associated with the respective MF axis. The Karcher mean is P P ¼1 6 T N ej q the generalization of the standard Euclidean sample mean With ¼ j¼1 i2Ij i we can find the optimum of to arbitrary manifolds. It is computed efficiently as the log posterior in closed form by noticing described in Appendix A. After this preprocessing step, we XN X6 X Log ðqiÞ q~z T T approximate mzi using the Karcher mean i as pro- q q Re max t i mzi ¼ max i j posed in [10] R2SOð3Þ R2SOð3Þ i¼1 j¼1 i2Ij (33) m NR : Log ðqiÞLog ðq~ÞþRq LogqðqiÞ; (27) ¼ max trfg m m ~ ~ R2SOð3Þ where the subscript zi was omitted for the sake of clarity Rm T S2 T S2 This has the same form as the orthogonal Procrustes prob- and q~ rotates vectors in q~ to m . Intuitively this q lem [68]. Hence, the optimal rotation can be computed in approximates the mapping of i into mzi with the mapping closed form using the SVD N ¼ USV T as of the Karcher mean into mz plus a correction term that i   accounts for the deviation of qi from the q~z . See Fig. 6 for an i R$ V ; ; UV T UT : illustration of underlying geometry. With this the cost func- ¼ diag 1 1 sign detð Þ (34) tion fðRÞ from Eq. (24) can be approximated by f~ðRÞ as STRAUB ET AL.: THE MANHATTAN FRAME MODEL—MANHATTAN WORLD INFERENCE IN THE SPACE OF SURFACE NORMALS 241

This has been used before to align point patterns by Umeyama [69] and applied to the MF rotation estimation in a slightly different way by [45].

4.4 Real-Time MF Inference on Streaming Data In the case of a stream of batches of surface normals obtained, for example, from an RGB-D camera, we impose a K matrix vMF [70] diffusion model with concentration tR. The Fig. 7. Graphical model for a mixture of MFs. conditional distribution of the current rotation R given the M previous rotation R is specific signed axis within the MF ci as described in Sec- È É tion 3. In the following sections it will be convenient to collect p R R ; RT R : k C ; ;w ;R ; S j tR / exp tR tr (35) all variables of the th MF into k ¼fck zk k k kg where S S 6 c z k ¼f kjgj¼1, ck ¼f igi:ci¼k and zk ¼f igi:ci¼k denote all This distribution is uniform over the rotation space for tR ¼ 0 labels ci or zi which are associated to the kth MF via c.TheMF R and concentrates on as tR increases. See [71] Ex. 2 for a axes of the kth MF are a function of the rotation Rk according modern treatment of the matrix vMF distribution. In practice 6 to Eq. (2) and will be denoted fmkjgj¼1. we chose t ¼ 1. This adds the negative log likelihood term First we define the MMF’s probabilistic model before we È É outline a sampling-based-inference scheme. We restrict the f RT R ; R ¼tRtr (36) analysis and inference method to the TG-MF model because the vMF distributions in the vMF-MF model necessitate more to the MAP cost functions derived in the previous sections. involved inference methods since the prior on the concentra- For the direct and the approximate RTMF algorithm this tion does not have a closed form as mentioned in Section 3.3. means an additional term in the Jacobian Hence an internal slice sampler would be required to sample @f R @ È É posterior concentration parameters for the vMF-MF. Rð ðvÞÞ RT R ¼tR tr ðvÞ @v ÂÃ@vÈ É È É È É T T T 5.1 Probabilistic Model ¼tR tr R G R tr R G R tr R G R : 1 2 3 Fig. 7 depicts a graphical representation of the probabilistic (37) MMF model. It is a Bayesian finite that takes into account the geometries of both S2 and SOð3Þ. In this For the vMF-based algorithm we can still derive a closed probabilistic model, the MMF parameters are regarded as from rotation MAP estimate random variables and we avoid assumptions from Section 4 È É R$ p R ; RT R about weights and covariances of the individual MFs. ¼ arg max log jq z; t þ tRtr R2SOð3Þ A surface normal qi is associated with an MF via the È É (38) NR~ ; N~ N RT : assignment variable ci. These MF-level assignments are ¼ arg max tr ¼ þ tR R2SOð3Þ assumed to be distributed according to a categorical distri- bution with a Dirichlet distribution prior with parameters a Note that the additional term stemming from the matrix vMF distribution acts as a regularizer if only one MF axis ci CatðpÞ; p DirðaÞ: (39) has associated observations. Each MF follows the outlined in 5THE MIXTURE OF MANHATTAN FRAMES Section 3.2 and hence a surface normal is distributed as As alluded to in the introduction, the description of man- qi p Cci ; g; D; n : (40) made environments on a global scale necessitates a more flexible model that can capture Manhattan Worlds with We set a < 1 to favor models with few MFs, which is typi- some relative rotation between them. This motivates the cal for man-made scenes. Contemporary buildings, for exam- extension of the MF framework described in Section 3 to the ple, customarily exhibit a small number of MFs. To MMF. In practice, scene representations may be composed encourage the association of equal numbers of normals to all of multiple intermediate representations, which may MF axes, we place a strong prior g 1 on the distribution of include MMFs, to facilitate higher-level reasoning (e.g., [7]). axis assignments zi. Intuitively, this encourages an MF to As such, adopting a probabilistic model allows one to explain several normal directions and not just a single one. describe and propagate uncertainty in the representation. Prior knowledge and model inherent measurement noise 5.2 Metropolis-Hastings MCMC Inference can be incorporated in a principled way. Conditional inde- We perform inference over the probabilistic MMF model pendence allows drawing samples in parallel and hence described in Section 5.1 using Gibbs sampling with Metropo- leads to tractable inference. lis-Hastings [12] split/merge proposals [13]. Specifically, the In the proposed MMF representation scenes consist of K sampler iterates over the latent assignment variables c and z, M ; ;M K w K MFs, f 1 ... Kg which jointly define 6 signed axes. For their parameters p and w ¼f kgk¼1, K ¼ 1, the MMF coincides with the MF. Specifically, let as well as the covariances in the tangent spaces around the q S2 i q S S K R K i 2 denote the th observed normal. In the MMF, each i MF axes ¼f kgk¼1 and the MF rotations R ¼f kgk¼1.We has two levels of association. The first, ci 2f1; ...;Kg, assigns first explain all posterior distributions needed for Gibbs sam- qi to the cith MF. The second, zi 2f1; ...; 6g,assignsqi to a pling before we outline the algorithm. 242 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018

5.2.1 Posterior Distributions for MCMC Sampling 5.3 Split/Merge Proposals The posterior distributions of both mixture weights are Here we derive split and merge proposals for the MMF model as well as their acceptance probability in an approach pðpjc; aÞ¼Dirða1 þ N1; ...; aK þ NKÞ (41) similar to Richardson and Green [13]. Note that a merge involves moving all points from MF l and m into a new MF p wk ; ; g Dir g Nk ; ...; g Nk ; 42 ð jc z Þ¼ ð 1 þ 1 k6 þ 6Þ ( ) n and then removing MFs l and m. Similarly, a split creates P N N two new MFs l and m from a single MF n. Hence, both a where k ¼ i¼1 11½ci¼k isP the number of normals assigned k N N split and a merge change the number of parameters in the to the th MF and kj ¼ i¼1 11½ci¼k11½zi¼j is the number of normals assigned to the jth axis of the kth MF. The indicator model. Specifically the parameters that change their dimen- function 11 a b is 1 if a ¼ b and 0 otherwise. sion are the set of MF rotations, R, and the set of covariances ½ ¼ S Using the likelihood of qi from Eq. (14), the posterior dis- on the MF axes, . The labels z and c remain of the same ;K tributions for labels ci and zi are given as dimensions; only the range for c changes from ½1 to ½1;K 1 (merge) or from ½1;K to ½1;Kþ 1 (split). There- X6 fore, we employ the theory of Reversible Jump Markov p c k ;q; Q w p q ; S i ¼ jp i / pk kj i; mkj kj (43) Chain Monte Carlo (RJMCMC) [72] to derive a proper j¼1 acceptance probability. RJMCMC is a generalization of p z j c ;q; Q w p q ; S ; i ¼ j i i / cij i; mcij cij (44) Metropolis-Hastings MCMC [12] and provides a way of computing an acceptance probability when the number of where Q ¼fw; S; Rg. We compute xi ¼ Log ðqiÞ, the mcizi parameters changes between moves. We will see that the 2 mapping of qi into T S , to obtain the scatter matrix split/merge proposals as well as the acceptance probabilities P mcizi S N x xT T S2 S are similar to what one would expect from the Metropolis- kj ¼ i 11½ci¼k11½zi¼j i i in mkj . Using kj the posterior S Hastings algorithm. For this reason and because the MH distribution over covariances kj is algorithm is more well-known, we chose to refer to the infer- ence algorithm as to Metropolis-Hastings MCMC. p Skjjc; z; q; R; D; n ¼ IW D þ Skj; n þ Nkj : (45)

Since there is no closed-form posterior distribution for an 5.3.1 RJMCMC Split/Merge Moves in an MMF MF rotation given axis-associated normals, we approximate RJMCMC utilizes auxiliary variables to propose determin- it as a narrow Gaussian distribution on SOð3Þ around the istic moves to change between model orders. In the follow- $ optimal rotation Rk under normal assignments z and c ing, we will give the RJMCMC algorithm for a merge proposal between two MFs. The inverse proposal of a split of p R ; ; R R$ R0; ; ; ; S ; kjz c q N k; k k z c q soð3Þ (46) an MF into two MFs follows the same but inverted process. A 3 3 0 Let a MF be parameterized by the random variables where S 2 R and R is set to Rk from the previous soð3Þ k CA. An RJMCMC merge proposal between MFs m and l is Gibbs iteration. Refer to Appendix B for details on how to executed in three steps. First, an auxiliary MF v is sampled evaluate and sample from this distribution. q l $ from ðmergeÞ to propose a merge of the current MFs and R M The (locally-) optimal rotation k 2 SOð3Þ of MF k m as will be described in Section 5.3.2. Second, the deter- given the assigned normals q ¼fqigi c k and their associa- ^ : i¼ ministic function fð½Cl; Cm; CvÞ ¼ ½u1; u2; Cn is used to tions zi to one of the six axes mkz can be found using any of i obtain the merged MF n parameterized by C^ n. The auxiliary the MAP MF inference algorithms (i.e., Section 4.1 or 4.3). MFs u1 and u2 absorb the MFs l and m from before the merge. The function fð½Cl; Cm; CvÞ is hence defined as 5.2.2 Metropolis-Hastings MCMC Sampling ^ The Gibbs sampler with Metropolis-Hastings split/merge u1 ¼ Cl; u2 ¼ Cm; Cn ¼ Cv: (47) proposals is outlined in Algorithm 1. For K MFs and N nor- mals the computational complexity per iteration is OðK2NÞ. Therefore, the Jacobian Jf of the function fð½Cl; Cm; CvÞ is To let the order of the model adapt to the complexity of the distribution of normals on the sphere, we implement Metrop- @f ½Cl; Cm; Cv Jf ¼ Â Ã ¼ I ; (48) olis-Hastings-based split/merge proposals. The details of the @ Cl; Cm; Cv algorithm are described in the following sections. where I is the identity matrix with determinant 1. Algorithm 1. One Iteration of the MMF Inference Third, the proposed merge is accepted with probability ()Q 1: Draw p j c; a using Eq. (41) K1 p C^ ; ; D; q ; ; ; S k¼1 k; a g n ðsplitÞ 2: Draw c j p q R in parallel using Eq. (43) min 1; Q detðJf Þ ; (49) K p C ; ; D; qðmergeÞ 3: for k 2f1; ...;Kg do k¼1 k; a g n 4: Draw wk j c; z; g using Eq. (42) 5: Draw z j c; w; q; R; S in parallel using Eq. (44) where parameters after the merge are designated with a hat. R ; ; S l m 6: Draw È kjzÉc q; soð3Þ using Eq. (46) The proposal distributions for a split of MFs and into 6 Skj ; ; ; D; n qðsplitÞ 7: Draw j¼1 j c z q R; n using Eq. (45) MF is denoted . 8: end for The RJMCMC split proposal of an MF n into MFs l and m 9: Propose splits for all MFs follows the same process except that a split is proposed 10: Propose merges for all MF combinations according to Section 5.3.3 instead of a merge. The determin- istic transformation is the inverse of fðÞ. This means that STRAUB ET AL.: THE MANHATTAN FRAME MODEL—MANHATTAN WORLD INFERENCE IN THE SPACE OF SURFACE NORMALS 243 the determinant of the Jacobian is 1 and the acceptance q n ; a DirMult n; al; am ðc jc Þ¼Z ðÞc probability for the split is Eq. (49) where the ratio has been (55) inverted. ¼ CatðcjpÞDirðp; al; amÞ dp; Note that the RJMCMC acceptance probability for split/ p P merge moves in an MMF looks like the Metropolis-Hastings K YK G ak G ak Nk acceptance probability, because j detðJf Þj ¼ 1. However, k¼1 ð þ Þ ; DirMultðc; aÞ¼ PK (56) G N GðakÞ since the model orders in the nominator and denominator k¼1 ak þ k k¼1 of the fractions are different, it technically is not a Metropo- P N c k N N lis-Hastings acceptance probability. and the counts k of labels i ¼ are k ¼ i¼1 11½ci¼k. Within each of the MFs l and m we assign normals q to 5.3.2 Merge Proposal in an MMF an axis by drawing the assignments zn as Let the two MFs l and m be parameterized by the random q zi ¼ jjwn;Rn; Sn; / wnj p qi; mnj; Snj : 57 variables Cl and Cm. A merged MF n can be sampled from q ( ) the current MFs l and m as follows. We first assign all nor- $ Using these assignments, we find optimal rotations Rl and l m n n $ mals of MF and to MF : cc2fl;mg ¼ , which corresponds Rm and draw Rl and Rm to the proposal distribution $ q Rl;Rmjz; c; q;Rn ¼N Rl; R ðRn; z; c; qÞ; S q ; ; n : l soð3Þ ðcl cmjcÞ¼dðfcl cmg Þ (50) $ (58) R R R ; ; ; ; S : N m; mð n z c qÞ soð3Þ Second, we sample the axes assignments zn according to Given rotations as well as labels, we can draw axis covarian- ces S l;m from the respective posterior q zi ¼ jjwl;Rl; Sl; q / wlj p qi; mlj; Slj : (51) f g q S l;m jc; z; q;R l;m ; D; n Next, given associations cn and zn, we find the optimal rota- f g f g tion using the closed form solution of the vMF-based model Y6 59 p S ; ; ;R p S ; ; ;R : ( ) derived in Section 4.3. This is justified because the direct ¼ ljjz c q l mjjz c q m j¼1 and the vMF-based algorithms generally found the same optimum in our experiments. Then we sample Rn from a $ The split proposal distribution factors as narrow Gaussian distribution over rotations with mean Rn $ qðsplitÞ¼q xl; xmjxn; q; a; g; D; n ¼ qðcnjc; aÞ q Rnjz; c; q;Rl ¼N Rn; R ðzn; cn; qÞ; S n soð3Þ q R ; ; ;R q S ; ; ;R $ T _ (52) fl;mgjz c q n fl;mgjc z q fl;mg R $ R ; S ; Y 60 ¼N ð n LogRn ðÞð nÞÞ ;0 soð3Þ ( ) q zijwn;Rn; Sn; q : $ R T $ i:ci¼n where LogRn ð Þ :SOð3Þ! Rn SOð3Þ denotes the logarithm $ R T $ R map of into the tangent space Rn SOð3Þ around n . The vee operator _ [52] extracts the unique elements of a skew- 5.3.4 RJMCMC Acceptance Probability symmetric matrix W 2 R33 into a vector w: After introducing the RJMCMC merge and the split pro- W _ ¼ w ¼½W ; W ; W 2R3. S 2 R33 is the 23 13 12 soð3Þ posals in the previous sections, we will now derive the covariance of the Normal distribution in TR$ SOð3Þ. Refer to n acceptance probabilities for those two moves by detailing Appendix B for an in depth discussion. the distributions involved in the computation of Eq. (49). Finally, we obtain samples for the axis covariances Sn The joint distribution for the MMF model is defined according to the proposal distribution by the graphical model depicted in Fig. 7. For the evalu- Y6 ation of the acceptance probability, we marginalize over q Snjc; z;Rn; q ¼ p Snjjzn; cn; q;Rn ; (53) the categorical variables p and w as in the split proposal j¼1 in Eq. (55) Z p Snj n; n; ;Rn; D; n where ð jz c q Þ is the posterior distribution pðc; aÞ¼ pðcjpÞpðp; aÞdp ¼ DirMultðc; aÞ (61) over covariance Snj under the IW prior given the assigned p T S2 Z normals in the tangent space mnj . The proposal of merging MF l and m into MF n factors as p zkjc; g ¼ p zkjc;wk p wk; g dwk wk 62 ( ) q q C C ; C ; ; ; D; q ; ¼ DirMultðzk; gÞ: ðmergeÞ¼ nj l m q; a g n ¼ ðcl cmjcÞ q RnjRl; z; c; q q Snjc; z; q;Rn; D; n Y (54) After marginalization of p and w, the joint distribution is q zijwl;Rl; Sl; q : Y6 i:ci¼n p q; c; z; S; R; a; g; D; n ¼ pðc; aÞ p Skj; D; n j 5.3.3 Split Proposal in an MMF (63) YN YK n l m p q c ;z;R ; S p R p ; First, we randomly assign normals in MF to MF or by ij i i ci cizi ð kÞ zkjc; g drawing MF labels according to the Dirichlet Multinomial i¼1 k¼1 (DirMult) distribution 244 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018

Fig. 8. Timing breakdown for the three different RTMF algorithms. The error bars show the one-s range. where we have assumedQ that the prior over rotations factors Fig. 9. Rotation estimation accuracy. Percentages of points assigned to p K p R according to ðRÞ¼ k¼1 ð kÞ. Therefore, the ratio of joint each MF axis is color-coded in the second row. probabilities in the merge move acceptance probability in Eq. (49) becomes with groundtruth (GT) camera rotations from a Vicon motion-capture system. The dataset was obtained by waving p q; ^c; z^; S^; R^; a; g; D; n p ^c; a p qj^c; z^; R^; S^ p z^j^c; g p S^; D; n p R^ ¼ the camera randomly in full 3D motion up-down as well as p q; c; z; S; R; a; g; D; n p c; a p qjc; z; R; S p zjc; g p S; D; n p R left-right in front of a simple MW scene for 90 s as can be Q Q Q 2p N p q c ; z ; ^; S^ K^ p 6 p S^ D; seen in the GT yaw-pitch-roll angles in the first row of Fig. 9. 8p ^c; a i ij^i ^i R k¼1 z^kj^c; g j¼1 kj; n ¼ Q Q Q ; N K 6 The approximate RTMF algorithm was run for 25 itera- p ; a p qi ci;zi; ; S p k ; g p Skj; D; n ðc Þ i j R k¼1 z jc j¼1 tions at most while the direct RTMF algorithm was run for (64) at most ten to keep computation time low. Any fewer itera- tions rendered the direct MF rotation estimation unusable. K^ K where ¼ 1. For a split proposal this ratio is inverted. Timings. We split the computation times into the follow- The acceptance probability of splits and merges of MFs ing stages: (1) applying a guided filter to the raw depth can be computed, by plugging Eq. (64) into Eq. (49). image, (2) computing surface normals from the smoothed depth image, (3) pre-computing of data and (4) 5.4 MAP Inference in the vMF-MMF Model optimization for the MF rotation. The timings shown in To design a simple inference algorithm for MAP estimation in Fig. 8 were computed over all frames of the dataset. At the vMF-MMF model we add a hierarchy level to the vMF- 111 ms per frame the direct method cannot be run in real- MF model from Section 4.3 as described in the previous MMF time. While the approximate method improves the runtime, section. For hard-assignment EM-based inference we only it is 15 ms slower than the vMF-based approach which runs additionally require the assignment of data points to an MF. in 18 ms. The approximate algorithm is slower since the c According to the MMF model the labels i are distributed as Karcher mean pre-computation is iterative whereas the X vMF pre-computation is single pass. As intended by shift- p c k q ;R w qT : i ¼ j i ; t / k pkj exp t i mkj (65) ing all computations over the full data into a pre-processing j step, the latter two RTMF algorithms can be run at a camera 1 Assuming pkj ¼ 6 and t all equal, like for the derivation of frame-rate of 30 Hz. In the following we omit the direct the vMF MAP algorithm, leads to the hard assignment rule method from the evaluation due to its slow runtime. X T Accuracy. Over the whole dataset the algorithms obtain ci ¼ arg max wk q k pkj exp t i mkj an angular RMSE from the Vicon groundtruth rotation of j (66) 2.5 degree for the vMF-based approach and 4.56 degree for T arg max arg max qi mkj þ log ðwkÞ: the approximate method showing the higher precision of k j the vMF-based closed-form rotation estimates. The percen- Given the MF-level assignment, data is assigned to MF tages of surface normals associated with the MF axes dis- internal axes as dictated by Eq. (31) and the kth MF rotation played in the second row of Fig. 9 support the intuition that a less uniform distribution of normals across the MF axes updated according to Eq. (34).P With the number of normals k N N results in a worse rotation estimate: large angular deviations assigned to the th MF, k ¼ i¼1 11½ci¼k, we estimate the MF proportions wk under the Dirichlet distribution prior as occur when there are surface normals on only one or two MF axes for several frames. Note that the rotation estimates Nk þ a of the RTMF algorithms are drift free—an important prop- wk ¼ : (67) N þ Ka erty for visual odometry systems. 6.2 Evaluation of MMF Inference 6EVALUATION We now evaluate MMF inference on various datasets We evaluate the properties and performance of the real-time across scales and compare against MF and VP estimation MF before the MMF inference algorithms. All evaluations algorithms. were run on an Intel Core i7-3940XM CPU at 3.00 GHz with With the RJMCMC-based approach, we infer an MMF in an NVIDIA Quadro K2000M GPU. a coarse-to-fine approach. First, we down-sample to 120 k normals and run the algorithm for T ¼ 150 iterations, pro- 6.1 Evaluation of Real-Time MAP Inference posing splits and merges throughout as described in Section S : 2 We show run-times and rotation estimation accuracy of all 5.2. We use the following parameters: soð3Þ ¼ð2 5 Þ I33, 2 three derived real-time MF inference algorithms on a dataset a ¼ 0:01, g ¼ 120, n ¼ 12k, and D ¼ð11 Þ n I22. For the STRAUB ET AL.: THE MANHATTAN FRAME MODEL—MANHATTAN WORLD INFERENCE IN THE SPACE OF SURFACE NORMALS 245

Fig. 10. Inferred MMF in scene and surface normal space. purpose of displaying results, we obtain MAP estimates from samples from the posterior distribution of the MMF. $ Fig. 12. Common failure cases of the MMF inference. First, we find the most likely number of MFs K from all samples after a burn-in of 100 RJMCMC iterations. We then MFs while the vMF RTMF algorithm seems to pick the most $ run MCMC starting form the latest sample that has K MFs prominent MF. using all data without proposing splits and merges. All Besides poor depth measurements due to reflections, MMF results displayed herein show the last MMF sample strong ambient light, black surfaces, or range limitations of of that chain. the sensor, the inference converged to the wrong number of The vMF-MMF MAP inference algorithm is sensitive to MFs mainly because of violations of the MW assumption the initial MF rotations. Hence, we run it 11 times each time such as round objects or significant clutter in the scene. We starting from 6 randomly rotated MFs and choose one of the observe that the algorithm fails gracefully, approximating models with the moste likely number of MFs after discard- round objects with several MFs or adding a “noise MF” to ing MFs with less than 10 percent of surface normals. capture clutter as can be seen in Fig. 12. Hence, to eliminate “noise MFs”, we consider only MFs with more than 10 per- cent of all normals for the following quantitative evaluation. 6.2.1 MMF Inference from Depth Images To evaluate the performance of the MMF inference algo- We first highlight different aspects and properties of the rithms, we ran them on the NYU V2 dataset [40] which con- inference using the 3-box scene depicted in Fig. 10. For this tains 1,449 RGB-D images of various indoor scenes. scene, we initialized the number of MFs to K ¼ 6. The algo- MF Count Inference. For each scene, we compare the num- rithm correctly infers K ¼ 3 MFs corresponding to the three ber of MFs the algorithm infers to the number of MFs a differently rotated boxes as displayed in Fig. 10 on the human annotator perceives. The confusion matrices for the sphere and in the point cloud. While the blue MF consists two MMF algorithms are only of the single box standing on one corner, the green and  red MFs contain planes of the surrounding room in addition 557 467 108 3 0 CMMF ¼ to their respective boxes. This highlights the ability of our 130 152 28 1 1 (68) model to pool normal measurements from the whole scene. 528 283 186 138 CvMF MMF ¼ : In Fig. 11 we show several typical indoor scenes of vary- 37 118 83 74 ing complexity and the inferred MF using the vMF-based RTMF algorithm, the MMF inferred by the MAP-MMF algo- The MMF algorithm infers the human perceived MMFs in rithm and the MMF inferred by the RJMCMC algorithm. 49.0 percent of the scenes while vMF-MMF is slightly worse The MMF inference algorithms were started with six MFs in with 44.6 percent and a tendency to overestimate the num- all cases. For the single MW scenes, all these algorithms ber of MFs. infer the same MF, for the multiple-MW scenes the MMF MW Orientation Accuracy. We use the groundtruth MW and the MAP-MMF algorithm infer the same reasonable orientation of the most prominent MW provided by [45] to

Fig. 11. Segmentation and inferred (M)MF of various indoor scenes partly taken from the NYU V2 depth dataset [40]. For single-MF scenes we color- code the assignment to MF axes and for MMF scenes the assignments to MFs. 246 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018

TABLE 1 Algorithm Timings on NYU V2 Dataset

Method RTMF MMF vMF MMF RMF MF BB Lez Lee Kos Barn Time [s] 0.037 0.18 3312 23.6 0.061 3.99 5.76 0.21 0.015

outperform all VP-based estimates. The inferred MMF mod- els outperform all other methods because of the higher flexi- bility of the model. The MF algorithms all show similar Fig. 13. CDFs of gravity and MW orientation estimation. performance. Timing. Table 1 gives an overview of run-times for the directly evaluate MW orientation estimation accuracy. We different algorithms averaged over the 1,449 scenes from take into account that the same MF axes defined according to the NYU V2 dataset. RTMF-vMF is the fastest algorithm R 24 Eq. (2) can be described by 24 rotations f MF;igi¼1. These are while the sampling-based algorithm is, unsurprisingly, constructed as: for all six permutations of choosing two col- theslowest.Itcould,however,bespedup,e.g.,by 0 umns from RMF;i, r and r , construct four rotation matrices employing a sub-cluster approach for split-merge pro- posals [10], [74]. 0 0 0 0 ½r; r ;r; ½r; r ; r; ½r; r ; r; ½r; r ;r; (69) 0 6.2.2 Additional Qualitative MMF Inference Results where r ¼ r r . To compute the angular deviation u of an estimated MF to the ground truth MF rotation we construct Triangulated Meshes. Algorithms such as Kintinuous [75] and R 24 the set f MF;igi¼1 from the inferred MF rotation, compute Elastic Fusion [76] allow dense larger scale indoor recon- R all angular rotation deviations to the groundtruth GT and structions from a stream of RGBD frames as depicted in the choose the smallest deviation middle of Fig. 1 for a couch area with some boxes, shelves   and a Lego house. Using the triangles’ surface normals, the 1 T 1 MMF can be inferred. The associations to one of the inferred u ¼ mini arccos trace R RMF;i : (70) 2 GT 2 MFs is shown in the middle of Fig. 14 while the associations to the MF axes within each of the MFs is shown to the right. In case of MMFs we choose the smallest deviation LiDAR Data. The large-scale LiDAR scan of Cambridge across MFs. (Fig. 1 right) has few measurements on the sides of build- Fig. 13 (right) depicts cumulative distribution functions ings due to reflections off the glass facades and inhomoge- (CDF) for the angular deviation of the different MF and neous point density because of overlapping scan-paths. To MMF algorithms, and two VP extraction algorithms, handle these properties, we implement a variant of robust [31] (VP Lez) and [17] (VP Lee), which extract three OVPs. moving-least-squares normal estimation [77]. Fig. 15 shows Evidently, MF algorithms estimate the MW rotation more the point cloud colored according to MF assignment of the accurately than the OVP algorithms. The MMF algorithms normals overlaid on a gray street-map. The inferred MFs show higher accuracy on the whole dataset than single MF share the upward direction without imposing any con- algorithms. straints. The inferred MFs capture large scale organizational Gravity Direction Estimation. Like the algorithms by structure in this man-made environment: blue and green Silberman et al. [40] and Gupta et al. [41] the proposed MF are the directions of Boston and Harvard respectively, and and MMF inference algorithms can be used to estimate the red is aligned with the Charles river. The locations belong- gravity direction to facilitate rotating scenes into a canonical ing to the MFs are spatially separated supporting that the frame for scene understanding. VPs are also indicative of MW assumption is best treated as a local property as argued the gravity direction and we show the performance of two in the introduction. additional VP algorithms [20] (VP Kos) and [29] (VP Barn). The mean direction of surface normals in the scene parts labeled as “floor” serves as a proxy for the true gravity direction in the evaluation. The CDFs of angular deviation from the gravity direction in Fig. 13 (left) demonstrates that all MF inference algo- rithms match the performance of Gupta et al. and clearly

Fig. 15. Inferred MMFs (top left) and MMF labels overlayed on top of a Fig. 14. MMF extracted from a mesh of an indoor couch area (see middle street map of Kendal square in Cambridge, MA. Normals associated of Fig. 1) obtained using Kintinuous [75]. with upward axes are hidden for clarity. STRAUB ET AL.: THE MANHATTAN FRAME MODEL—MANHATTAN WORLD INFERENCE IN THE SPACE OF SURFACE NORMALS 247

7CONCLUSION construct the analog of a Gaussian distribution utilizes the linearity of the tangent spaces. Let LogR ðRÞ :SOð3Þ! Guided by the observation that regularities of the orienta- m so 3 R tions of planes composing man-made environments mani- ð Þ denote the logarithm map of into the associated Lie so 3 u arccos 1 trace RT R 1 fest in structured surface normal distributions, we have Algebra ð Þ. With ¼ ð2 ð ð m Þ ÞÞ proposed the MF model which captures the MW assump-  tion in the space of surface normals. We have formalized R u RT R RT R : LogR ð Þ¼ m m (72) the notion of the MF and explored two different probabilis- m 2 sin ðuÞ tic models and resulting MAP inference algorithms. These W R real-time capable inference algorithms are useful for extract- As a member of soð3Þ, the matrix ¼ LogRm ð Þ is skew- _ w ing the local MW orientation and segmentation of a scene symmetric. The vee operator [52] inverts ½ from Eq. (13) from a stream of data. Motivated by the observation that on by extracting the unique elements of W into a vector w: W _ w W W W R3 a larger scale the commonly-made MW assumption is easily ¼ ¼½ 23; 13; 122 . With this we can define R broken, we have extended the MF model to a mixture of a normal distribution with mean rotation m and covariance S R33 MFs. Besides a simple MAP inference algorithm, a mani- soð3Þ 2 in soð3Þ fold-aware Gibbs sampling algorithm with Metropolis- p R R ; S R _ ; S : Hastings split/merge proposals, allows adaptive and robust ; m soð3Þ ¼N LogRm ð Þ ;0 soð3Þ (73) inference of MMFs. This enables the proposed model to describe both complex small-scale-indoor and large-scale- urban scenes. We have demonstrated the versatility of our In order to sample from the distribution in Eq. (73), we w W _ ; S model by extracting MMFs from 1.5 k indoor scenes, larger sample ¼ Nð0 soð3ÞÞ and map from soð3Þ to SOð3Þ Exp :so3 SO 3 indoor reconstructions and from aerial LiDAR data of Cam- using the exponential map Rm ð Þ! ð Þ from R Exp R R w; w bridge, MA. Since VP estimation algorithms showed lower Eq. (12) and rotating by m: Rm ¼ m ð k k2Þ. For fur- accuracy in both gravity and MF orientation estimation we ther details on SOð3Þ, the log and exp maps, and the relation conclude that if surface normal data is available MF infer- to the Lie Algebra soð3Þ, refer to [52], [57]. ence algorithms should be used. Beyond the herein demon- strated applications of gravity direction estimation, drift ACKNOWLEDGMENTS free rotation estimation and scene segmentation, we envi- We thank J. Chang for his help with the split/merge pro- sion that the MF and MMF models can enable robust and posals, R. Cabezas for his help with the Cambridge dataset, flexible modeling of real-world scenes for a variety of tasks, T. Whelan for giving access to Kintinuous and R. Finman such as scene understanding, 3D reconstruction and 3D for helping to scan indoor environments. J.S., O.F., J.L. and perception. J.F. were partially supported by the ONR MURI program, award N00014-11-1-0688. J.S., O.F. and J.F. were also par- APPENDIX A tially supported by the DARPA, award FA8650-11-1-7154. THE KARCHER MEAN G.R. was partially funded by a MIT-Technion Postdoctoral q q N Fellowship. The Karcher mean ~ of a set of points on a manifold f igi¼1 is a generalization of the sample mean in Euclidean space [78]. It is a local minimizer of the following weighted REFERENCES cost function [1] J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direction from a single image by Bayesian inference,” in Proc. 7th XN IEEE Int. Conf. Comput. Vis., 1999, pp. 941–947. 2 q~ ¼ arg min wid ðp; qiÞ: (71) [2] B. K. P. Horn, “Extended Gaussian images,” Proc. IEEE, vol. 72, p2M i¼1 no. 12, pp. 1671–1686, Dec. 1984. [3] D. Holz, S. Holzer, and R. B. Rusu, “Real-time plane segmentation w M S2 d ; d ; using RGB-D cameras,” in Proc. RoboCup Symp., 2011, pp. 306–317. Here, i ¼ 1, ¼ , and ð Þ ¼ Gð Þ In this case, [4] J. Straub, N. Bhandari, J. J. Leonard, and J. W. Fisher III, “Real- excepting degenerate sets, it has a single minimum. It may time Manhattan world rotation estimation in 3D,” in IEEE/RSJ Int. be computed by the following iterative algorithm: Conf. Intell. Robots Syst., 2015, pp. 1913–1920. [5] N. J. Mitra, A. Nguyen, and L. Guibas, “Estimating surface nor- Int. J. Comput. Geometry Appl. N 2 mals in noisy point cloud data,” , 1) project all fqigi into Tqt S and compute their sample P ¼1 ~ vol. 14, no. 04n05, pp. 261–276, 2004. x 1 N q [6] M. Botsch, L. Kobbelt, M. Pauly, P. Alliez, and B. Levy, Polygon mean ¼ N i¼1 Logq~t ð iÞ. x T S2 q Mesh Processing. Boca Raton, FL, USA: CRC Press, 2010. 2) project from q~t back onto the sphere to obtain ~tþ1 ¼ x [7] R. Cabezas, J. Straub, and J. W. Fisher III, “Semantically-aware Expq~t ðÞ. aerial reconstruction from multi-modal data,” in Proc. IEEE Int. x 3) iterate until jjjj2 is sufficiently close to 0. Conf. Comput. Vis., 2015, pp. 2156–2164. [8] G. Schindler and F. Dellaert, “Atlanta world: An expectation max- PPENDIX imization framework for simultaneous low-level edge grouping A B and camera calibration in complex man-made environments,” in NORMAL DISTRIBUTION OVER SOð3Þ Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., 2004, pp. I-203–I-209. R R33 A matrix 2 is called a rotation matrix if it is an ele- [9] J. Straub, G. Rosman, O. Freifeld, J. J. Leonard, and J. W. Fisher III, ment of SOð3Þ, the Special Orthogonal group; namely, “A mixture of Manhattan frames: Beyond the Manhattan world,” RT R ¼ I and detðRÞ¼1. Probability distributions over rota- in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 3770–3777. [10] J. Straub, J. Chang, O. Freifeld, and J. W. Fisher III, “A Dirichlet tion matrices can be defined by exploiting the manifold process mixture model for spherical data,” in Proc. 18th Int. Conf. structure of SOð3Þ (e.g., [52], [78]). In particular, a way to Artif. Intell. Statist., 2015, pp. 930–938. 248 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 1, JANUARY 2018

[11] J. Straub, T. Campbell, J. P. How, and J. W. Fisher III, “Small-vari- [37] R. Cipolla, T. Drummond, and D. P. Robertson, “Camera calibra- ance nonparametric clustering on the hypersphere,” in Proc. IEEE tion from vanishing points in image of architectural scenes,” in Conf. Comput. Vis. Pattern Recog., 2015, pp. 334–342. Proc. British Mach. Vis. Conf., 1999, pp. 38.1–38.10. [12] W. K. Hastings, “Monte Carlo sampling methods using Markov [38] M. Antunes and J. P. Barreto, “A global approach for the detection chains and their applications,” Biometrika, vol. 57, pp. 97–109, 1970. of vanishing points and mutually orthogonal vanishing [13] S. Richardson and P. J. Green, “On Bayesian analysis of mixtures directions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, with an unknown number of components (with discussion),” J. pp. 1336–1343. Roy. Statist. Soc., vol. 59, no. 4, pp. 731–792, 1997. [39] J.-P. Tardif, “Non-iterative approach for fast and accurate vanish- [14] R. Hartley and A. Zisserman, Multiple View Geometry in Computer ing point detection,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Vision, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, Apr. 2004. 2009, pp. 1250–1257. [15] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli, “Geometric [40] P. Kohli, N. Silberman, D. Hoiem, and R. Fergus, “Indoor segmen- image parsing in man-made environments,” in Proc. 11th Eur. tation and support inference from RGBD images,” in Proc. 12th Conf. Comput. Vis., 2010, pp. 57–70. Eur. Conf. Comput. Vis., 2012, pp. 746–760. [16] E. Delage, H. Lee, and A. Y. Ng, “Automatic single-image 3D [41] S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and reconstructions of indoor Manhattan world scenes,” in Robotics recognition of indoor scenes from RGB-D images,” in Proc. IEEE Research. Berlin, Germany: Springer, 2007, pp. 305–321. Conf. Comput. Vis. Pattern Recog., 2013, pp. 564–571. [17] D. C. Lee, M. Hebert, and T. Kanade, “Geometric reasoning for [42] A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra, “Rapter: single image structure recovery,” in Proc. IEEE Conf. Comput. Vis. Rebuilding man-made scenes with regular arrangements of Pattern Recog., 2009, pp. 2136–2143. planes,” ACM Trans. Graph., vol. 34, no. 4, pp. 103:1–103:12, 2015. [18] V. Hedau, D. Hoiem, and D. Forsyth, “Recovering the spatial lay- [43] B. Peasley, S. Birchfield, A. Cunningham, and F. Dellaert, out of cluttered rooms,” in Proc. IEEE 12th Int. Conf. Comput. Vis., “Accurate on-line 3D occupancy grids using Manhattan world 2009, pp. 1849–1856. constraints,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2012, [19] C. Liu, A. G. Schwing, K. Kundu, R. Urtasun, and S. Fidler, pp. 5283–5290. “Rent3D: Floor-plan priors for monocular layout estimation,” in [44] J. J. Leonard and H. F. Durrant-Whyte, “Simultaneous map build- Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 3413–3421. ing and localization for an autonomous mobile robot,” in Proc. [20] J. Kosecka and W. Zhang, “Video compass,” in Proc. 7th Eur. Conf. IEEE/RSJ Int. Conf. Intell. Robots Syst., 1991, pp. 1442–1447. Comput. Vis., 2002, pp. 476–490. [45] B. Ghanem, A. Thabet, J. C. Niebles, and F. Caba Heilbron, [21] M. E. Antone and S. Teller, “Automatic recovery of relative cam- “Robust Manhattan frame estimation from a single RGB-D era rotations for urban scenes,” in Proc. IEEE Conf. Comput. Vis. image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, Pattern Recog., 2000, pp. 282–289. pp. 3772–3780. [22] J.-C. Bazin and M. Pollefeys, “3-line RANSAC for orthogonal van- [46] K. Joo, T.-H. Oh, J. Kim, and I. S. Kweon, “Globally optimal Man- ishing point detection,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots hattan frame estimation in real-time,” in Proc. IEEE Conf. Comput. Syst., 2012, pp. 4282–4287. Vis. Pattern Recog., 2016, pp. 1763–1771. [23] T. Kroeger, D. Dai, and L. Van Gool, “Joint vanishing point extrac- [47] R. Triebel, W. Burgard, and F. Dellaert, “Using hierarchical em to tion and tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., extract planes from 3D range scans,” in Proc. IEEE Int. Conf. Robot. 2015, pp. 2449–2457. Autom., 2005, pp. 4437–4442. [24] P. Moghadam and J. F. Dong, “Road direction detection based on [48] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, vanishing-point tracking,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots “Reconstructing building interiors from images,” in Proc. IEEE Syst., 2012, pp. 1553–1560. Int. Conf. Comput. Vis., 2009, pp. 80–87. [25] M. Bosse, R. Rikoski, J. Leonard, and S. Teller, “Vanishing points [49] N. Neverova, D. Muselet, and A. Tremeau, “2 1/2 D scene recon- and three-dimensional lines from omni-directional video,” Visual struction of indoor scenes from single RGB-D images,” in Proc. 4th Comput., vol. 19, no. 6, pp. 417–430, 2003. Int. Workshop Comput. Color Imag., 2013, pp. 281–295. [26] A. Flint, D. Murray, and I. Reid, “Manhattan scene understanding [50] A. E. Johnson and M. Hebert, “Surface registration by matching using monocular, stereo, and 3D features,” in Proc. IEEE Int. Conf. oriented points,” in Proc. Int. Conf. Recent Advances 3-D Digit. Comput. Vis., 2011, pp. 2228–2235. Imag. Model.), 1997, pp. 121–128. [27] O. Saurer, F. Fraundorfer, and M. Pollefeys, “Homography based [51] M. Kazhdan, “Reconstruction of solid models from oriented point visual odometry with known vertical direction and weak Manhat- sets,” in Proc. 3rd Eurographics Symp. Geometry Process., 2005, Art. tan world assumption,” in Proc. IEEE/IROS Workshop Visual Con- no. 73. trol Mobile Robots, 2012. [52] G. S. Chirikjian, Stochastic Models, Information Theory, and Lie [28] P. Denis, J. H. Elder, and F. J. Estrada, “Efficient edge-based meth- Groups, Volume 2: Analytic Methods and Modern Applications, vol. 2. ods for estimating Manhattan frames in urban imagery,” in Proc. Berlin, Germany: Springer, 2011. 10th Eur. Conf. Comput. Vis., 2008, pp. 197–210. [53] C. Bingham, “An antipodally symmetric distribution on the [29] S. T. Barnard, “Interpreting perspective images,” Artif. Intell., sphere,” Ann. Statist., vol. 2, no. 6, pp. 1201–1225, 1974. vol. 21, pp. 435–462, 1983. [54] J. T. Kent, “The Fisher-Bingham distribution on the sphere,” J. [30] R. T. Collins and R. S. Weiss, “Vanishing point calculation as a sta- Roy. Statist. Soc., vol. 44, pp. 71–80, 1982. tistical inference on the unit sphere,” in Proc. 3rd Int. Conf. Comput. [55] K. V. Mardia and P. E. Jupp, , vol. 494. Vis., 1990, pp. 400–403. Hoboken, NJ, USA: Wiley, 2009. [31] J. Lezama, R. Gioi, G. Randall, and J.-M. Morel, “Finding vanish- [56] P. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on ing points via point alignments in image primal and dual Matrix Manifolds. Princeton, NJ, USA: Princeton Univ. Press, 2009. domains,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, [57] M. P. do Carmo, Riemannian Geometry. Basel, Switzerland: pp. 509–515. Birkhauser€ Verlag, 1992. [32] H. Wildenauer and A. Hanbury, “Robust camera self-calibration [58] O. Rodrigues, Des lois geometriques qui regissent les deplacements from monocular images of Manhattan worlds,” in Proc. IEEE Conf. d’un systeme solide dans l’espace: Et de la variation des cordonnees prov- Comput. Vis. Pattern Recog., 2012, pp. 2831–2838. enant de ces deplacements consideres independamment des causes qui [33] Y. Xu, S. Oh, and A. Hoogs, “A minimum error vanishing point peuvent les produire, 1840. detection approach for uncalibrated monocular images of man- [59] S. L. Altmann, Rotations, Quaternions, and Double Groups. North made environments,” in Proc. IEEE Conf. Comput. Vis. Pattern Chelmsford, MA, USA: Courier Corporation, 2005. Recog., 2013, pp. 1376–1383. [60] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and [34] E. Lutton, H. Maitre, and J. Lopez-Krahe, “Contribution to the deter- D. B. Rubin, Bayesian Data Analysis. Boca Raton, FL, USA: CRC mination of vanishing points using Hough transform,” IEEE Trans. Press, 2013. Pattern Anal. Mach. Intell., vol. 16, no. 4, pp. 430–438, Apr. 1994. [61] N. I. Fisher, Statistical Analysis of Circular Data. Cambridge, U.K.: [35] C. Rother, “A new approach to vanishing point detection in archi- Cambridge Univ. Press, 1995. tectural environments,” Image Vis. Comput., vol. 20, pp. 647–655, [62] A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra, and G. Ridgeway, 2002. “Clustering on the unit hypersphere using von Mises-Fisher [36] B. Caprile and V. Torre, “Using vanishing points for camera cali- distributions,” J. Mach. Learn. Res.,vol.6,no.9,pp.1345–1382, bration,” Int. J. Comput. Vis., vol. 4, pp. 127–139, 1990. 2005. STRAUB ET AL.: THE MANHATTAN FRAME MODEL—MANHATTAN WORLD INFERENCE IN THE SPACE OF SURFACE NORMALS 249

[63] M. Bangert, P. Hennig, and U. Oelfke, “Using an infinite von Oren Freifeld received the BSc and MSc Mises-Fisher mixture model to cluster treatment beam directions degrees in biomedical engineering from Tel-Aviv in external radiation therapy,” in Proc. 9th Int. Conf. Mach. Learn. University, in 2005 and 2007, respectively, and Appl., 2010, pp. 746–751. the ScM and PhD degrees in applied mathemat- [64] S. Gopal and Y. Yang, “Von Mises-Fisher clustering models,” in ics from Brown University, in 2009 and 2013, Proc. Int. Conf. Mach. Learn., 2014, pp. 154–162. respectively. Later, he was a postdoc with MIT [65] G. Nunez-Antonio and E. Gutierrez-Pena, “A Bayesian analysis of Computer Science & Artificial Intelligence Labo- directional data using the von Mises-Fisher distribution,” Com- ratory. He is currently an assistant professor mun. Statist. Simul. Comput., vol. 34, pp. 989–999, 2005. in the Department of Computer Science, Ben- [66] J. Gallier, “Basics of classical Lie groups: The exponential map, Lie Gurion University. His main fields of research groups, and Lie algebras,” in Geometric Methods and Applications. interests include computer vision, statistical infer- Berlin, Germany: Springer, 2001, pp. 367–414. ence, and machine learning. He is a member of [67] E. Eade, “Monocular simultaneous localisation and mapping,” the IEEE. PhD dissertation, Dept. Eng., Univ. Cambridge, Cambridge, U.K., 2008. [68] P. H. Schonemann,€ “A generalized solution of the orthogonal pro- Guy Rosman received the BSc, MSc, and PhD crustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966. degrees from the Technion (with the Jacobs- [69] S. Umeyama, “Least-squares estimation of transformation param- Qualcomm fellowship), Computer Science Depart- eters between two point patterns,” IEEE Trans. Pattern Anal. Mach. ment, 2004, 2008, and 2013, respectively. He is a Intell., vol. 13, no. 4, pp. 376–380, Apr. 1991. post-doctoral fellow with MIT/CSAIL, where he [70] C. Khatri and K. Mardia, “The von Mises-Fisher matrix distribu- received the Technion-MIT post-doctoral fellow- tion in orientation statistics,” J. Roy. Statist. Soc., vol. 39, pp. 95– ship and is working in the Distributed Robotics Lab 106, 1977. and the Sensing, Learning and Inference Group. [71] N. Boumal, A. Singer, P.-A. Absil, and V. D. Blondel, “Cramer– He has worked at several companies/labs, includ- Rao bounds for synchronization of rotations,” Inf. Inference, vol. 3, ing IBM/HRL, RAFAEL, Medicvision, and Invision no. 1, pp. 1–39, 2014. Biometrics. He is a member of the IEEE. [72] P. J. Green, “Reversible jump Markov chain Monte Carlo compu- tation and Bayesian model determination,” Biometrika, vol. 82, pp. 711–732, 1995. John J. Leonard received the BSEE degree in [73] B. K. Horn, “Closed-form solution of absolute orientation using electrical engineering and science from the Uni- unit quaternions,” J. Opt. Soc. America A, vol. 4, no. 4, pp. 629–642, versity of Pennsylvania, in 1987 and the DPhil 1987. degree in engineering science from the University [74] J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture of Oxford, in 1994. He is Samuel C. Collins pro- models using sub-clusters splits,” in Proc. 26th Int. Conf. Neural fessor of mechanical and ocean engineering. He Inf. Process. Syst., 2013, pp. 620–628. is also a member of the MIT Computer Science [75] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, and and Artificial Intelligence Laboratory. His J. McDonald, “Kintinuous: Spatially extended KinectFusion,” in research addresses the problems of navigation Proc. RSS Workshop RGB-D Adv. Reasoning Depth Cameras, vol. 35, and mapping for autonomous mobile robots. He no. 14, 2015, pp. 1697–1716. is a fellow of the IEEE, in 2014. [76] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison, “Elasticfusion: Dense SLAM without a pose graph,” John W. Fisher III (M’98) received the PhD in Proc. Robot. Sci. Syst., 2015. degree in electrical and computer engineering [77] S. Fleishman, D. Cohen-Or, and C. T. Silva, “Robust moving least- from the University of Florida (UF), Gainesville, in squares fitting with sharp features,” in Proc. SIGGRAPH, 2005, 1997. He is currently a principal research scien- pp. 544–552. tist in the Computer Science and Artificial Intelli- [78] X. Pennec, “Probabilities and statistics on Riemannian manifolds: gence Laboratory, Massachusetts Institute of Basic tools for geometric measurements,” in Proc. IEEE Workshop Technology (MIT), Cambridge, and affiliated with Nonlinear Signal Image Process., 1999, pp. 194–198. the Laboratory for Information and Decision Sys- tems, MIT. Prior to joining MIT, he has been affili- Julian Straub received the graduate diplom from ated with UF as both a faculty member and the Technische Universitat€ Munchen,€ in 2012 graduate student since 1987. His current area of and the MSc degree from the Georgia Institute of research focus includes information theoretic approaches to signal proc- Technology, in 2012, both in electrical engineer- essing, multimodal data fusion, machine learning, and computer vision. ing, within the ATLAS double degree program of He is a member of the IEEE. the European Union. Currently, he is working toward the PhD degree in computer science spe- cializing in artificial intelligence at MIT. His " For more information on this or any other computing topic, research is centered on computer vision, Bayes- please visit our Digital Library at www.computer.org/publications/dlib. ian inference, and 3D environment perception for robots. He is a member of the IEEE.