Robust SfM with Little Image Overlap

Yohann Salaun¨ 1,2, Renaud Marlet1, and Pascal Monasse1 1LIGM, UMR 8049, Ecole´ des Ponts, UPE, Champs-sur-Marne, France 2CentraleSupelec,´ Chatenay-Malabry,ˆ France {yohann.salaun,renaud.marlet,pascal.monasse}@imagine.enpc.fr

Abstract pictures can be shot, because of physical or time constraints. It may be the case when digitizing a building which is still Usual Structure-from-Motion (SfM) techniques require in use, not to disturb occupants. Reducing the number of at least trifocal overlaps to calibrate cameras and recon- images, to only a few per room for indoor scenes, is also a struct a scene. We consider here scenarios of reduced im- way to reduce the cost and time for acquiring and process- age sets with little overlap, possibly as low as two images ing information, as long as a minimum level of accuracy can at most seeing the same part of the scene. We propose a still be reached for the targeted application. new method, based on coplanarity hypotheses, for es- Another issue concerns the lack of texture in environ- timating the relative scale of two independent bifocal cali- ments such as building interiors, as it greatly reduces the brations sharing a camera, without the need of any trifocal amount of feature points detected in images, also leading information or Manhattan-world assumption. We use it to to uneven feature distributions. Besides, if the number of compute SfM in a chain of up-to-scale relative motions. For images is small, it is likely that the baselines are wide, as accuracy, we however also make use of trifocal information well as the view angles between overlapping images. Con- for line and/or point features, when present, relaxing usual sequently, because of perspective distortion, point descrip- trifocal constraints. For robustness to wrong assumptions tors are harder to match; relaxing matching thresholds does and mismatches, we embed all constraints in a parameter- not fix the problem, as it introduces outliers. Moreover, de- less RANSAC-like approach. Experiments show that we can tected points may unknowingly lie on a single , possi- calibrate datasets that previously could not, and that this bly giving rise to degenerate configurations for camera reg- wider applicability does not come at the cost of inaccuracy. istration. On the contrary, lines do not suffer from lack of texture and are prevalent in interiors, where they often oc- cur at plane intersections and object boundaries, see Fig.1. 1. Introduction Furthermore, lines are robust to significant changes of view- points, although their matching eventually also degrades. Structure-from-Motion (SfM) has made spectacular im- Our approach thus leverages on lines for pose estimation provements concerning scalability [9, 43, 15], accuracy and reconstruction, although it also exploits points when [26,6,3, 40] and robustness [42, 28]. However, most ap- available. Our main contributions are as follow: proaches assume a significant amount of overlap between images. In this work, we consider the case where images • Given two successive bifocal calibrations, i.e., sharing arXiv:1703.07957v2 [cs.CV] 28 Mar 2017 have only little overlap, possibly as low as two images at a middle camera, we propose a novel method for com- most seeing the same part of the scene. puting their relative scale without the need for trifocal This situation commonly occurs when a scene is pho- features. It is based on line coplanarity hypotheses, tographed by people with little or no knowledge in pho- which is relevant in man-made scenes (cf. Sect.4). togrammetry. Even when informed, they can make occa- • To however exploit trifocality when present, we show sional mistakes, widening too much a baseline, or shooting how to relax usual trifocal constraints, and integrate a low-quality image that cannot be exploited and has to be them with coplanarity constraints in a unified frame- skipped. Another example is when exploiting street views, work. These relaxed constraints only need one feature e.g., taken from a vehicle, if the viewpoints are too distant triplet and 2 of the bifocal calibrations (cf. Sect.5). one from another, or if the camera is too close to the facades • Robustness to outliers in this context (cf. Sect.6) re- because of narrow sidewalks. In such cases, SfM methods quires parameters that may be hard to set given the va- may yield partial or fragmented calibrations. riety of scenes and constraint types. We propose a pa- It also applies to situations where only a small number of rameterless method that integrates all constraints and

1 Figure 1. Top row: 3 consecutive views of the Office-P19 dataset, with detected line segments. Bottom row: views of this area with reconstructed structures: points (left), lines (middle), bounding boxes of planes associated to clusters of coplanar lines after BA (right).

adapts automatically to the scene diversity (cf. Sect.7). As for global SfM methods, their main objective is merg- • Using both standard SfM datasets with ground truth ing relative motions between two cameras into a consistent and difficult interior scenes with little overlap and little graph of all cameras. Besides robustness concerns, to get texture, we validate empirically our technical choices rid of outlier edges, and various approaches to average rota- and we show (1) that we can calibrate images that tions, one of the main issues is that the relative translations cannot be calibrated with existing point- or line-based are only given up to an unknown scale factor; only their SfM methods, and (2) that our accuracy is on par with directions are known. Most methods to infer global transla- state-of-the-art SfM methods on less hard datasets. tions rely on information redundancy assuming a densely- Our method does not assume any Manhattan configuration. connected graph [12,6], or on additional information from trifocal tensors [36, 26] (hence requiring 4 tracked points 2. Related Work across 3 views). A number of other methods [18, 31,2] compute the global translations, possibly along with the 3D Incremental SfM methods [37, 25, 43] register a new im- points, by solving equations relating points visible in two age to a partial model already constructed, using 3D-2D images; however, they implicitly assume that enough points correspondences. A number of Perspective-n-Points (PnP) are visible in at least 3 images to cancel the degrees of free- algorithms [20, 11] have been proposed for solving this re- dom of the relative scale factors. Besides, they do not all section problem. Whatever the method, at least 3 correspon- address point match outliers. dences common to 3 images are required: three 3D points The situation is similar with line-based SfM. In [47], visible in the same 3 images, and up to minimum of 6 [49]. an initial image triplet with common line matches is re- Hierarchical SfM methods, that additionally merge par- quired. Then, given a partial model, Perspective-n-Line tial models, also have similar constraints. In [14], the two (PnL) methods estimate the pose of a camera in which three models to merge are overlapping in the sense that they share 3D lines reproject [23, 48], which implies that at least 3 one or several images, and pairs of 3D points projecting lines are visible in 3 views. A minimum of 6 lines is some- to the same 2D features in both models are used to relate times even desirable to prevent noise sensitivity [23], if the models, which implies tracks of length 3 or more. In not 9 for applicability and 25 for accuracy [29]. [41], models are merged via feature matches between im- More generally, when associating both points and lines ages separately associated to each model, requiring that in a “Perspective-n-Features” framework, a minimum of 3 the same four 3D points are reconstructed in both models, features visible in 3 views is still required [30, 44]. which implies in turn tracks of length 4 (connections be- Our approach for relating the scale of two bifocal cali- tween 2 tracks of length at least 2). Even with relaxed re- brations is based on coplanar line pairs. To our knowledge, quirements where merging uses 4 points that are seen but coplanar lines have been used for pose estimation, but only not necessarily reconstructed in the other model [10], fea- in a two-view context and with a Manhattan-world assump- ture tracks of length 3 are required. tion, to identify planar structures [19]. A related topic is plane-based SfM, but it has mostly been studied assuming a prior knowledge (user-given) about the scene planes [39,4] b or in tracking scenarios with videos [50]. In [32, 18], a ref- erence plane is used to estimate both the global translations and 3D points, but it must be visible in all images.

3 A related work regarding the estimation of scale factors 1 b a 2 a 2 and the identification of planar structures concerns direct b structure estimation (DES) via homography estimations, with the computation of coplanar point clusters, but it does 3 1 not estimate poses and it also relies on trifocal points [17]. 2 Line triangulation and line bundle adjustment has been well studied given an initial global pose estimation [5], but Figure 2. If La and Lb, without trifocal overlap, are known to be C ,C ,C not associated to coplanarity issues. coplanar, it is enough to place 1 2 3 in the same reference frame knowing only bifocal calibrations for cameras 1-2 and 2-3. 3. From Relative to Global Pose Estimation 4. Coplanarity Constraint We consider the case where the epipolar graph mainly contains long chains or long cycles with little or no trifocal Let La and Lb be two non-parallel 3D line segments in a relations. For every edge in the graph between camera i and plane P. (Coplanarity here is actually just an hypothesis, to camera j, we assume the relative pose (Rij, tij) known (es- be validated in a RANSAC-like manner, see Sect.6.) Sup- timated), where Rij is the relative rotation and tij is the unit pose La only appears on cameras 1 and 2, whereas Lb only j norm translation direction. We are interested in estimating appears on cameras 2 and 3. Let li be the projection of Li the scale factors λij relating the translation directions tij to on camera j. (See Fig.2.) the global relative translations Tij = λijtij, that is, up to a The coordinates of a 3D point P in the global reference j j single global scale factor. frame relates to its projection p on camera j and depth zp:

In the following, we only consider the case of a single > j j chain or cycle of cameras, for which we estimate global P = Rj (zpp − Tj) (4) poses (R ,T ), where rotations R , translations T as well j j j j Assuming P ∈ Li, we also have for any camera k: as camera centers Cj are defined in the same reference k frame. Yet, our pose estimation method can be integrated li · (RkP + Tk) = 0, (5) in a general global SfM framework for arbitrary graphs, where “·” denotes scalar product. Combining (4) and (5), e.g., as described in [35] to evenly distribute errors over the and relating global to relative motions using (1), we can ex- whole graph in trying to satisfy the coherency constraints: j press the depth zp in terms of the relative pose information: > Rj = RijRi,Tj = RijTi + Tij,Cj = −Rj Tj. (1) k > j j k j j li · (RkRj (zpp − Tj) + Tk) = li · (Rjkzpp + Tjk) = 0 k This could be associated to a method to remove outlier l · Tjk ⇔ zj = − i . (6) p k j edges and enforce cycle consistency [13, 45,8]. li · (Rjkp ) When considering a single chain or cycle of cameras, We also know that the to plane P is given by: global motions are recursively defined as: > 1 > 2 > 2 > 3 nP = dLa ×dLb = (R1 la ×R2 la)×(R2 lb ×R3 lb ) (7) R1 = I and Rj+1 = Rj,j+1Rj (2) where d is the 3D direction of line L . Let M be a point T1 = 0 and Tj+1 = Rj,j+1Tj + λj,j+1tj,j+1 (3) Li i in P. As any line in P is orthogonal to nP , we have: As the global pose remains defined up to an unknown scale P ∈ P ⇔ nP · (P − M) = 0 (8) factor, we additionally set λ12 = 1: distances are thus de- fined with unit length λ12. (In case of a cycle, we could also Since P ∈ Li ⊂ P, using (4) and (6) yields: include epipolar constraints to close the loop and distribute > j j errors as in [35], but we do not in our implementation.) Fi- nP · (Rj (zpp − Tj) − M) = 0 ⇔ nally, a bundle adjustment refine the initial pose estimation. lk · T n · (− i jk R>pj + C − M) = 0 ⇔ In the following, to simplify notations, we assume that P k j j j li · (Rjkp ) features are normalized, i.e., with camera intrinsic parame- k > j (li · Tjk)(nP · (Rj p )) ter matrices Kj = I. We also denote an arbitrary triplet of n · (C − M) = (9) P j k j successive cameras 1, 2, 3 rather than j, j + 1, j + 2. li · (Rjkp ) Now using (9) both for a point Pa ∈ La with j = 2 and In step 3, as detailed below, the scale factor λ23 is ex- k = 1, and a point Pb ∈ Lb with j = 2 and k = 3, we get: pressed in the form:

(l3 · T )(n · (R>p2)) (l1 · T )(n · (R>p2)) ku × (v + λw)k b 23 P 2 b a 21 P 2 a λ = arg min , (11) 3 2 = 1 2 ⇔ 23 lb · (R23pb ) la · (R21pa) λ∈R kuk kv + λwk 3 3 2 > 2 l · T23 (l · (R23p ))(nP · (R2 pa)) b = b b ⇔ where u, v, w are known vectors in R3. To find the mini- l1 · T (l1 · (R p2))(n · (R>p2)) a 21 a 21 a P 2 b mum, we look for the values of λ that make the derivative 3 2 > 2 1 λ23 (l · (R23p ))(nP · (R p ))(l · t21) vanish. It leads to a third-degree polynomial, that simpli- = b b 2 a a (10) 1 2 > 2 3 λ21 (la · (R21pa))(nP · (R2 pb ))(lb · t23) fies into a second-degree polynomial, whose roots can be tested to find the minimum of the original expression. As Eq. (10) expresses the signed ratio of the distances between above, this formula has a degenerate configuration that can 3 successive camera centers in the same reference frame, in be checked and discarded using an angular threshold. terms of only 2-view relative pose information, plus weak Trifocal point constraint. Given a triplet of matched information about 2 coplanar lines that do not have to be points pˆ = (p , p , p ) in cameras 1-2-3, we can estimate visible in all 3 views. It has 3 degenerate configurations that 1 2 3 the corresponding 3D point P˜ from p , p , assuming λ = can be assessed and controlled with an angular threshold. 1 2 12 1, and reproject P˜ as p˜ on camera 3, for some given λ : Note that this criterion does not depend on the endpoints 3 23 of line segments, that are notoriously inaccurate. Line seg- p˜3 = R3(P˜ − C3) = R3(P˜ − C2) − λ23t23 ment matches can actually be grossly wrong because of over-segmentation, which is a common weakness of line ∗ We look for the scale factor λ23 that makes p˜3 the closest segment detectors. On the contrary, our criterion only relies to the observation p3. Rather than minimizing the distance on (infinite) lines, which are much more stable. It can cope between the two 2D points, we minimize the angle p\3C3p˜3: with line segments matches that are wrong w.r.t. endpoints but correct w.r.t. the supporting 3D line. ∗ kp3 × p˜3k λ23 = arg min λ ∈ kp3k kp˜3k 5. Simplified Trifocal Constraints, When Any 23 R kp × (R (P˜ − C ) − λ t )k = arg min 3 3 2 23 23 , (12) In case a line or point is visible in three successive views ˜ λ23∈R kp3k kR3(P − C2) − λ23t23k 1-2-3, we consider a corresponding simplified trifocal con- straint, that also involves the scale factors λ12, λ23. For both which is in the form (11). kinds of features, we do as follows: Trifocal line constraint. Given a triplet of matched line ˆ 1. We set arbitrarily the first two camera centers given segments l = (l1, l2, l3) in cameras 1-2-3, we can estimate T their relative pose: C1 = 0, λ12 = 1, C2 = −R2 T12. the corresponding 3D line L˜ from l1, l2, assuming λ12 = 1, ˜ 2. We compute the approximate 3D reconstruction of the and reproject L˜ as l3 on camera 3, for some given λ23. As ˜ feature from the first 2 camera poses, as well as its l3 also is the normal of the plane defined by C3 and L˜, we projection on the third camera. have for any point P˜ on L˜: 3. We compute the scale factor λ such that the projected 23 ˜ ˜ feature on camera 3 corresponds best to the detection. l3 ∝ R3[dL˜ × (P − C3)] As in step 1 we set λ = 1, the computed scale factor 12 We look for the scale factor λ∗ that makes ˜l the most col- λ is actually also the ratio τ = λ /λ . For a more 23 3 23 123 23 12 inear with the observation l , i.e., that minimizes the relative robust and accurate estimation, we can symmetrize the set- 3 angle between l and ˜l : ting: permuting the role of cameras and averaging the re- 3 3 sulting scale factors. As we just consider here a chain of kl × ˜l k successive views, it makes sense to actually only take into λ∗ = arg min 3 3 23 ˜ account the case where cameras 1 and 3 are swapped, keep- λ23∈R kl3k kl3k ing camera 2 as central. It prevents camera configurations kl × (R [d × (P˜ − C )])k = arg min 3 3 L˜ 3 1-3 or 3-1, which are likely to feature wider changes of ˜ λ23∈R kl3k kR3[dL˜ × (P − C3)]k viewpoints, i.e., wider baselines and view angles. Bifocal kl × (R [d ×(P˜−C )] − λ R [d ×t ])k calibration 1-3 may even fail and not be available. (In our = arg min 3 3 L˜ 2 23 3 L˜ 23 ˜ implementation, we thus only compute the ratios τ123 and λ23∈R kl3k kR3[dL˜ ×(P −C2)] − λ23R3[dL˜ ×t23]k τ321, retaining the average τ = (τ123 + 1/τ321)/2.) Note (13) that, as opposed to usual trifocal constraints that require at which is in the form of (11). As for coplanarity (Sect.4), least three trifocal features, one is enough here. this constraint does not depend on line segment endpoints. 6. Robust Estimation Sampling. In practice, the total number of features (hy- pothesized coplanar line pairs, trifocal points, trifocal lines) As there is no oracle to safely pick non-parallel coplanar is generally low enough (usually less than 10,000) for all lines (cf. Sect.4), we adopt a RANSAC-like approach to features to be tried rather than sampled. sample candidate line pairs and select the associated scale λ23 with which the largest number of pairs agree. More 7. Parameter-Free Robust Estimation generally, we also sample and check agreement w.r.t. trifo- cal points and lines when any (cf. Sect.5), which provides Rather than depend on fixed arbitrary error thresholds to robustness to wrong detections and matching too. For this, define feature agreement with a model, we actually resort we have to define a measure of residual error for the three to an a contrario (AC) approach [24, 25]: we compute the different kinds of features (hypothesized coplanar line pairs, expectation of the number of false alarms (NFA), that mea- trifocal points, trifocal lines), given a presumed λ23 = λ sures the statistical meaningfulness (actually the converse, obtained from the sample (discarding degenerate cases). i.e., the insignificance) of λ23 w.r.t. to the features to test, and select the scale λ with the lowest NFA. It allows an Coplanar lines. For any quadruplet of line segments 23 automatic optimization of the inlier-outlier threshold, and ˇl = (l1, l2, l2, l3) s.t. l1, l2 match in cameras 1-2 and l2, l3 a a b b a a b b thus more accurate results too. This is also consistent with match in cameras 2-3, we estimate the associated 3D lines the method we use for two-view pose estimation [34]. L ,L and consider the 3D point P ∈ L (resp. P ∈ L ) a b ab a ba b Coplanar line NFA. For coplanarity, we follow [34] and that is the closest to L (resp. L ) in 3D, with P =P b a ab ba define a line error from the error for line pairs: if La,Lb are coplanar. We then consider their reprojection 2 2 p˜ab, p˜ba on the shared camera 2. The residual error is their dco(Li, λ) = min dco(Li,Lj, λ) (14) ˇ 2 2 Lj coplanar with Li pixel distance: dco(l) = dco(La,Lb) = d(˜pab, p˜ba). 2 2 Considering p˜ab, p˜ba rather than Pab,Pba removes one where dco(Li,Lj, λ) is the coplanarity distance defined in dimension of error, along the direction of C2, possibly con- Sect.6. The NFA for our sampling is then, following [25]: straining λ less. But positioning by triangulation along this   2 k−2 direction is generally less accurate and thus less meaning- n2 π dco(Lk, λ) NFAco(λ) = (n2−2) min n2N ful, leading people to rather use reprojected distances and k∈[3,nco] k−2 A rely on other views to capture any error along this direction. (15) To avoid degenerate cases with mostly parallel lines, we where n2 is the number of lines in camera 2 that have at also discard line pairs whose 3D directions are similar. (In least a match in camera 1 or 3, A is the image area, and Lk our experiments, we use a threshold of 15◦.) Note that these is the k-th best inlier (with lowest error). directions can be computed from the global rotations only, Trifocal point NFA. For triplets of matched points, fol- before global translations are estimated. As there can be lowing [25], we have: many pairs candidate for coplanarity, we only consider, for    2 k−1 npt π dpt(ˆpk, λ) each segment in camera 2, having a match in camera j, its NFApt(λ) = (npt−1) min k k∈[2,n ] k A N closest neighbors having a match in the other camera, pt (16) with a distance defined as the minimum distance between where n is the total number of matched point triplets in segment endpoints. (In our experiments, N=10). pt cameras 1-2-3, A is the image area, dpt(ˆp, λ) is the residual Trifocal point error. For each triplet of matched points error of triplet pˆ assuming scale λ (cf. Sect.6), and pˆk is the pˆ = (p1, p2, p3) in cameras 1-2-3, we estimate the corre- k-th lowest error. sponding 3D point P from p1, p2, and reproject it as p˜3 on Trifocal line NFA. For triplets of matched line seg- camera 3 (assuming λ23), as in Sect.5. The residual er- ments, also following [25], we have: ror is the pixel distance of reprojection p˜3 to the observed   " ˆ #k−1 detection p3. For robustness, we actually symmetrize this nseg 2D dseg(lk, λ) NFAseg(λ) = (nseg−1) min k measure by swapping cameras 1 and 3, and averaging the k∈[2,nseg] k A distances: dpt(ˆp) = (d(˜p1, p1) + d(˜p3, p3))/2. (17) Trifocal line error. For each triplet of matched line seg- where nseg is the total number of matched line triplets, D is ˆ ˆ ments l = (l1, l2, l3) in cameras 1-2-3, we estimate the cor- the diagonal length of the picture, dseg(l, λ) is the residual responding 3D line L from l , l , and reproject it as ˜l on ˆ ˆ 1 2 3 error of triplet l assuming scale λ (cf. Sect.6), and lk is the camera 3 (assuming λ23), as in Sect.5. The residual error is k-th lowest error. the average of the pixelic distance between the reprojected Global NFA. As these NFAs correspond to expectations (infinite) line ˜l and the two endpoints of the detected seg- 3 of independent events, the global NFA is their product: ment l3, as in [5, 47]. For robustness, we also swap cameras ˆ 1 and 3, and define dseg(l) as the average of both errors. NFA(λ) = NFAco(λ).NFApt(λ).NFAseg(λ). (18) It estimates the insignificance of a candidate ratio λ based coplanar lines; line pairs that were determined as coplanar, on coplanarity and trifocal constraints, when any. For robust i.e., RANSAC inliers, are treated individually. estimation, we retain the λ with the overall lowest NFA. It Residuals and optimization. The parameters of our BA defines a parameterless AC-RANSAC variant to Sect.6. are tracked points and lines, as well as camera positions and Note that this pose estimation method can be heavily par- orientations. The error to minimize is the sum of the square allelized, not only to evaluate the different constraints but (pixelic) distance of the reprojected points and lines to their also to estimate the global motion for any three consecutive detection (cf. Sect.6), in all the cameras that see (i.e., detect views, to be later combined in a chain as defined in Sect.3. and match) them, plus the sum of the square coplanarity d (L ,L ) L ,L 8. Bundle Adjustment residuals co a b (cf. Sect.6), for all line pairs a b found as inliers, in all images that see both lines. The last step of our SfM method is a bundle adjustment Concretely, we initialize the bundle adjustment with the (BA) that refines simultaneously the structure and the poses. pose found by composing the scaled relative motions (cf. Reconstructed structure. In our context, we could con- Sect.3) and with triangulated features. As BA can be sensi- sider as structures not only points and lines, but also planes tive to initialization and given that rotations are often better corresponding to coplanar lines. Indeed, just as tracks of estimated than translations, we first refine the structure and points or lines across images represent single structures, motion with fixed rotations, then refine all parameters (as in single planes could be associated to sets of lines sharing [27, 26]). We use the Ceres solver [1] for minimization. a coplanarity constraint. More precisely, only coplanar line pairs with similar plane orientations should be clustered to- 9. Experiments gether, as a line can belong to two different planes at edges, e.g., where a wall meets another wall, floor or ceiling. Feature detection and matching. Points are detected Yet, we observed on real scenes that such a clustering of and matched with SIFT [21]. Lines are detected with coplanar lines into single planes tends to degrade the accu- MLSD [33] and matched with LBD [46]. Both kinds of racy of pose estimation. Our interpretation is that individual features are tracked across consecutive pictures to identify pairs of (real 3D) lines can be coplanar enough to robustly match triplets (hence trifocal overlaps), when any. The code 1 estimate sensible scale ratios but, when grouped in a single is available on GitHub . plane, their global coplanarity deteriorates. Bifocal calibration. We implemented our SfM approach In fact, contrary to 3D points, that are well determined on top of the two-view relative pose estimation of [34]. Be- although possibly misdetected or mismatched in images, sides being parameterless, it has the advantage of robustly there is no such thing in the real world as perfect 3D lines, combining both line and point features, when any, providing perfect 3D planes, nor exact line parallelism, state-of-the-art accuracy even in textureless environments or coplanarity. (Optical distortion and detection noise just and with wide baselines. When points are not available, it come on top of it.) This is especially true of edges and sur- only assumes that both images contains at least two pairs of faces in a building, as tolerances of straightness and flatness matched lines that are parallel in 3D; this constraint is most in the construction industry range typically from 0.2 to 1%. often met in indoor scenes. If not met, our assumption (see- Line-based calibration is thus prone to be less accurate in ing lines in views 1-2 coplanar with lines seen in views 2-3) practice than point-based calibration, and even less when is likely not to be met either. two lines are involved in a feature, as in line coplanarity. Using method [34] is not intrinsic to our approach. We Buildings also present many cases of near colinearity, could have used just as well any other bifocal calibration hence near line coplanarity, due to close edges at the bound- method, e.g., based on points assuming enough (5 or 7) are ary of thin surfaces, e.g., baseboards, conduits, moldings, available on all image pairs, or based on lines as in [7], al- picture frames, whiteboards, window and door frames, etc. though it assumes a Manhattan scene, contrary to [34]. Furniture edges also tend to be almost but not exactly copla- Datasets. We consider both difficult interior scenes and nar with wall edges. Due to small errors, such nearly copla- a standard SfM dataset of outdoor scenes with ground truth. nar lines are considered as inliers, but degrade accuracy. • The indoor dataset pictures various office rooms with This consideration is consistent with [34] where the au- little texture and little image overlap. Office-P19 (cf. thors observe that, in real data, lines that should “logically” Fig.1), Meeting-P31 and Trapezoid-P17 (cf. Fig.3) be parallel turn out not to be as much as expected, lead- consist of cycles of images. Pn means n pictures. ing to a number of close but different-enough vanishing Trapezoid-P17 does not belong to a Manhattan world. points (VPs). They found however that treating parallel line Resolution is 5184 × 3456. As can be seen on Fig.1, pairs independently leads to more accurate calibrations than points are dense on some textured objects, like the merging them into a single VP. Similarly, in our bundle ad- justment, we do not consider planes as the support of many 1(https://github.com/ySalaun/LineSfM) Method RANSAC threshold (pixels) AC-RAN Feature Points All Points Lines Copl. All Scene 0.5 1 3 6 9 SAC Scene +lines +BA Castle P19 146 90.5 107 90 196 101 Castle P19 97.9 80 98 129 101 22.2 Castle P30 91 83 73 76 144 72 Castle P30 80.3 111.2 78 134 72 23.8 Entry P10 7.6 7.9 8.4 9.3 10.6 7.8 Entry P10 7.9 12.1 8.0 8.9 7.8 5.6 Fountain P11 2.2 2.1 2.4 2.5 3.8 2.3 Fountain P11 2.3 2.5 2.3 52 2.3 2.4 Herz-Jesu P8 4.0 4.1 3.9 7.1 6.4 4.2 Herz-Jesu P8 4.1 4.3 4.1 5.4 4.2 3.8 Herz-Jesu P25 8.5 9.0 8.6 13.3 16.9 7.9 Herz-Jesu P25 8.4 15.4 8.4 36 7.9 5.4 Average 43.2 32.8 33.9 33.0 63.0 32.5 Average 33.5 37.6 33.1 60.9 32.5 10.5 Table 1. Left: comparison of RANSAC with a single fixed threshold to AC-RANSAC. Right: contribution of the different kinds of features. The numbers measure the average position error of cameras (in mm) before bundle adjustment. Best results for given scene in bold.

Method Our OpenMVG Olsson Cui Arie Jiang Bundler VSfM Scene method [26] [27] [6] [2] [16] [37] [43] Castle P19 22.2 25.6 76.2 — — — 344 258 Castle P30 23.8 21.9 66.8 21.2 — 235 300 522 Entry P10 5.6 5.9 6.9 — — — 55.1 63.0 Fountain P11 2.4 2.5 2.2 2.5 4.8 14 7.0 7.6 Herz-Jesu P8 3.8 3.5 3.9 — — — 16.4 19.3 Herz-Jesu P25 5.4 5.3 5.7 5.0 7.8 64 21.5 22.4 Table 2. Average position error: comparison with global [26, 27,6,2, 16] and incremental [37, 43] SfM methods. Best results in bold.

Method Our method OpenMVG [26] Scene Castle P19 0.30 0.21 Castle P30 0.29 0.21 Entry P10 0.006 0.90 Fountain P11 0.26 0.19 Figure 3. Left: Meeting-P31. Right: Trapezoid-P17 (non- Herz-Jesu P8 0.28 0.20 Manhattanness can be seen on the ceiling at room corner). Herz-Jesu P25 0.29 0.20 Office P19 1.07 6/20 door, but scarce in large other parts of the scene, e.g., Meeting P31 1.79 9/31 white walls. Trapezoid P17 3.60 3/17 • Strecha et al.’s dataset [38] is a de facto standard for assessing the accuracy of SfM. It consists of 6 outdoor Table 3. Mean reprojection error of all 3D points of all cameras scenes with ground truth for camera poses. We con- (in pixels). In red, fraction of calibrated cameras in case of failure. sider both the full dataset as well as subsets of images Our residual error in indoor scenes is about 4 to 10 times larger than on Strecha’s dataset, but picture resolution is 3 times as large. to reduce image overlap. Resolution is 3072 × 2048. All images have been corrected for radial distortion. RANSAC with and without parameters. Table1 (left) tures alone, the residual includes the reprojection error of compares RANSAC with a single fixed threshold for all coplanar lines (which may or may not be trifocal). three kinds of features (cf. Sect.6) to the parameterless AC- Trifocal line features provide just a little less accuracy RANSAC (cf. Sect.7). Although AC-RANSAC does not than trifocal points, which are prevalent in this textured always yield the best results, it works better on average. For datatset. Used together, they lead to a slightly better accu- a given scene, a better accuracy could be achieved by set- racy. Coplanarity features alone yield a little less accuracy ting the different thresholds for each kind of features, but it than trifocal lines; only Fountain-P11 has a poor calibration would not be practical, hence the interest of AC-RANSAC. with coplanarity, probably because it contains less planes In the following, all experiments rely on AC-RANSAC. than other scenes. Yet, after being merged with other fea- tures, coplanarity does not degrade accuracy in general, and Contribution of the different kind of features. Table1 often even improves it. (right) reports the accuracy of the different kinds of features alone, or when used jointly. When studying coplanarity fea- State-of-the-art accuracy. Tab.2 shows our general ap- Figure 4. Reconstructed structure of Castle-P30: points (left), lines (middle), bounding boxes of coplanarity planes after BA (right). proach outperforms incremental point-based SfM methods [37, 43] regarding accuracy, and is on par with state-of-the- art global SfM methods [26, 27,6,2, 16]. The wider appli- cability thus was not traded for accuracy, even in scenarios with dense features and much overlap. It also shows that our relaxed trifocal constraints retain the most relevant part of trifocality contribution to calibra- tion. Comparing to Tab.1 (right), we can see that copla- Figure 5. Castle-P(19−3): OpenMVG (left), our method (right). narity constraints alone provide in general comparable or better accuracy than incremental SfM methods [37, 43], ex- cept again on Fountain-P11. Residuals are shown in Tab.3. We could not compare with line-based SfM methods [47] as their code is unavailable and as they do not measure their accuracy against standard datasets with ground truth. Figure 6. Only our method can calibrate the very wide baseline and viewpoint change of Herz-Jesu-P3 (furthest 3 images in P8). Succeeding when others fail. Our method can calibrate difficult datasets that other methods fail to calibrate: More generally, disconnected components in the trifocal In Office-P19, some triplets do not have overlapping fea- graph make other SfM methods fail to calibrate all cameras tures (neither points nor lines), which makes it impossible (in the same reference frame), even if the bifocal graph is to calibrate with existing SfM methods. Our method suc- connected. (For other methods to work, connections in the ceeds. Fig.1 illustrates some (consecutive) views of the trifocal graph actually even have to be supported by at least dataset with line detections, and point, line and plane re- 3 features.) However, provided there are 3D lines in the constructions. Apart from a few line outliers due to mis- configuration described in Fig.2, we can hope to calibrate matches, line and plane reconstructions are qualitatively all cameras in the dataset. good. We can also calibrate Meeting-P31 and Trapezoid- P17, whereas OpenMVG fails; the residuals and number of 10. Conclusion calibrated cameras are given in Tab.3. Removing 3/19 images from Castle-P19 can be enough We have presented a novel SfM method that exploits to cause other methods to fail to calibrate all cameras, thus lines and coplanarity constraints to estimate camera poses also reducing reconstructed structures and ability to model in difficult settings with little image overlap and possibly the whole scene (cf. missing wall parts in Fig.5, left). Our little or no texture, causing other methods to fail. method calibrates all cameras, yielding an average location Our experiments show that our method combines the ac- error of 13.2 cm. It is not as good as the state-of-the-art error curacy of usual trifocal constraints although we relax them, of 2.3 cm when all 19 images are available (cf. Tab.2), but and the robustness (in the sense of wider applicability) of still better than Bundler (34.4 cm) and VSfM (25.8 cm). We coplanar constraints. It can thus be blindly used to cali- can actually remove up to 11/19 images and still estimate brate scenes that other methods fail to calibrate, and still to all camera poses with average error 18.4 cm, which remains provide state-of-the-art results on less difficult scenes, with better than incremental methods with all 19 images. smaller baselines, more overlap or more texture. Likewise, keeping only 3 images (left-most, middle, The coplanarity constraint we have defined is actually right-most) from Herz-Jesu-P8 (Fig.6) leads other methods not intrinsically specific to lines. It could apply to points as to totally fail, while we calibrate these very-wide-baseline well, as long as they can be associated to planes, e.g., using images with 6 mm accuracy, the same as full P8. a method to fit multiple homographies in two views [22]. Future work also include the use of coplanarity constraints [16] N. Jiang, Z. Cui, and P. Tan. A global linear method for for dense surface reconstruction, even in textureless scenes. camera pose registration. In IEEE International Conference on Computer Vision (ICCV 2013), pages 481–488, 2013.7, References 8 [17] N. Jiang, W. Lin, M. N. Do, and J. Lu. Direct structure esti- [1] S. Agarwal, K. Mierle, and Others. Ceres solver. http: mation for 3D reconstruction. In IEEE Conference on Com- //ceres-solver.org.6 puter Vision and Pattern Recognition (CVPR 2015), pages [2] M. Arie-Nachimson, S. Z. Kovalsky, I. Kemelmacher- 2655–2663, 2015.3 Shlizerman, A. Singer, and R. Basri. Global motion esti- [18] F. Kahl and R. I. Hartley. Multiple-view geometry under the mation from point matches. In 3DIMPVT, 2012.2,7,8 l∞-norm. IEEE Trans. PAMI, 30(9):1603–1617, 2008.2,3 [3] F. Arrigoni, B. Rossi, and A. Fusiello. On computing the [19] C. Kim and R. Manduchi. Planar structures from line corre- translations norm in the epipolar graph. In International spondences in a Manhattan world. In 12th Asian Conference Conference on 3D Vision (3DV 2015), 2015.1 on Computer Vision (ACCV 2014), pages 509–524, 2014.2 [4] A. Bartoli and P. Sturm. Constrained Structure and Mo- [20] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An accurate tion from multiple uncalibrated views of a piecewise pla- O(n) solution to the PnP problem. International Journal of nar scene. International Journal of Computer Vision (IJCV Computer Vision (IJCV 2008), 81(2):155–166, 2008.2 2003), 52(1):45–64, 2003.3 [21] D. G. Lowe. Distinctive image features from scale-invariant [5] A. Bartoli and P. F. Sturm. Structure-from-motion using keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004. lines: Representation, triangulation, and bundle adjustment. 6 Computer Vision and Image Understanding, 100(3):416– 441, 2005.3,5 [22] L. Magri and A. Fusiello. T-linkage: A continuous relax- ation of J-linkage for multi-model fitting. In IEEE Confer- [6] Z. Cui, N. Jiang, C. Tang, and P. Tan. Linear global transla- ence on Computer Vision and Pattern Recognition (CVPR tion estimation with feature tracks. In British Machine Vision 2014), pages 3954–3961, 2014.8 Conference (BMVC 2015), 2015.1,2,7,8 [7] A. Elqursh and A. Elgammal. Line-based relative pose esti- [23] F. M. Mirzaei and S. I. Roumeliotis. Globally optimal pose mation. In IEEE Conference on Computer Vision and Pattern estimation from line correspondences. In IEEE International Recognition (CVPR), pages 3049–3056, June 2011.6 Conference on Robotics and Automation (ICRA 2011), pages 5581–5588. IEEE, 2011.2 [8] O. Enqvist, F. Kahl, and C. Olsson. Non-sequential structure from motion. In ICCV Workshops, pages 264–271, 2011.3 [24] L. Moisan and B. Stival. A probabilistic criterion to detect rigid point matches between two images and estimate the [9] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, fundamental . Int. J. Comput. Vision, 57(3):201–218, R. Raguram, C. Wu, Y.-H. Jen, E. Dunn, B. Clipp, S. Lazeb- May 2004.5 nik, and M. Pollefeys. Building Rome on a cloudless day. In K. Daniilidis, P. Maragos, and N. Paragios, editors, 11th Eu- [25] P. Moulon, P. Monasse, and R. Marlet. Adaptive Structure ropean Conference on Computer Vision (ECCV 2010), pages from Motion with a contrario model estimation. In 11th 368–381. Springer Berlin Heidelberg, 2010.1 Asian Conference on Computer Vision (ACCV 2012) - Vol- [10] A. Fusiello, F. Crosilla, and F. Malapelle. Procrustean point- ume Part IV, ACCV’12, pages 257–270, Berlin, Heidelberg, line registration and the NPnP problem. In International 2013. Springer-Verlag.2,5 Conference on 3D Vision (3DV 2015), pages 250–255, Oct. [26] P. Moulon, P. Monasse, and R. Marlet. Global fusion of rela- 2015.2 tive motions for robust, accurate and scalable structure from [11] V. Garro, F. Crosilla, and A. Fusiello. Solving the PnP motion. In Proceedings of the IEEE International Confer- problem with anisotropic orthogonal Procrustes analysis. In ence on Computer Vision (ICCV), pages 3248–3255, 2013. International Conference on 3D Imaging, Modeling, Pro- 1,2,6,7,8 cessing, Visualization Transmission (3DIMPVT 2012), pages [27] C. Olsson and O. Enqvist. Stable structure from motion for 262–269, Oct. 2012.2 unordered image collections. In SCIA, pages 524–535, 2011. [12] V. M. Govindu. Combining two-view constraints for motion LNCS 6688.6,7,8 estimation. In Conference on Computer Vision and Pattern [28]O. Ozyes¸il¨ and A. Singer. Robust camera location estimation Recognition (CVPR 2001), 2001.2 by convex programming. In IEEE Conference on Computer [13] V. M. Govindu. Robustness in motion averaging. In ACCV, Vision and Pattern Recognition (CVPR 2015), June 2015.1 2006.3 [29]B.P ribyl,ˇ P. Zemcˇ´ık, and M. Cadik.ˇ Camera pose estimation [14] M. Havlena, A. Torii, and T. Pajdla. Efficient Structure from from lines using Plcker coordinates. In X. Xie, M. W. Jones, Motion by graph optimization. In 11th European Conference and G. K. L. Tam, editors, British Machine Vision Confer- on Computer Vision (ECCV 2010), pages 100–113, Berlin, ence (BMVC 2015), pages 45.1–45.12. BMVA Press, Sept. Heidelberg, 2010.2 2015.2 [15] J. Heinly, J. L. Schonberger,¨ E. Dunn, and J.-M. Frahm. Re- [30] S. Ramalingam, S. Bouaziz, and P. F. Sturm. Pose estimation constructing the world* in six days *(as captured by the Ya- using both points and lines for geo-localization. In IEEE In- hoo 100 million image dataset). In Conference on Computer ternational Conference on Robotics and Automation (ICRA Vision and Pattern Recognition (CVPR 2015), 2015.1 2011), pages 4716–4723, 2011.2 [31] A. L. Rodr´ıguez, P. E. Lopez-de´ Teruel, and A. Ruiz. Re- [46] L. Zhang and R. Koch. An efficient and robust line segment duced epipolar cost for accelerated incremental SfM. In Con- matching approach based on LBD descriptor and pairwise ference on Computer Vision and Pattern Recognition (CVPR geometric consistency. J. Vis. Comun. Image Represent., 2011), 2011.2 24(7):794–805, Oct. 2013.6 [32] C. Rother. Linear multi-view reconstruction of points, lines, [47] L. Zhang and R. Koch. Structure and motion from line cor- planes and cameras using a reference plane. In 9th IEEE respondences: Representation, projection, initialization and International Conference on Computer Vision (ICCV 2003) sparse bundle adjustment. Journal of Visual Communication - Volume 2, ICCV ’03, pages 1210–1217, Washington, DC, and Image Representation, 25(5):904–915, 2014.2,5,8 USA, 2003. IEEE Computer Society.3 [48] L. Zhang, C. Xu, K.-M. Lee, and R. Koch. Robust and effi- [33] Y. Salaun,¨ R. Marlet, and P. Monasse. Multiscale line seg- cient pose estimation from line correspondences. In K. M. ment detector for robust and accurate SfM. In 23rd Inter- Lee, Y. Matsushita, J. M. Rehg, and Z. Hu, editors, 11th national Conference on Pattern Recognition (ICPR), 2016. Asian Conference on Computer Vision (ACCV 2012), pages 6 217–230. Springer, 2012.2 [34] Y. Salaun,¨ R. Marlet, and P. Monasse. Robust and accurate [49] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astr˚ om,¨ and M. Oku- line- and/or point-based pose estimation without Manhattan tomi. Revisiting the PnP problem: A fast, general and op- assumptions. In European Conference on Computer Vision timal solution. In IEEE International Conference on Com- (ECCV), 2016.5,6 puter Vision (ICCV 2013), pages 2344–2351, Dec. 2013.2 [35] G. C. Sharp, S. W. Lee, and D. K. Wehe. Toward multiview [50] Z. Zhou, H. Jin, and Y. Ma. Robust plane-based Structure registration in frame space. In IEEE International Confer- from Motion. In IEEE Conference on Computer Vision and ence on Robotics and Automation (ICRA 2001), 2001.3 Pattern Recognition (CVPR 2012), pages 1482–1489, Wash- [36] K. Sim and R. Hartley. Recovering camera motion using ington, DC, USA, 2012.3 l∞ minimization. In Conference on Computer Vision and Pattern Recognition (CVPR 2006), volume 1, 2006.2 [37] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Ex- ploring photo collections in 3D. ACM Transations on Graph- ics (TOG 2006), 25(3):835–846, July 2006.2,7,8 [38] C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. In IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR 2008), pages 1–8, June 2008.7 [39] P. F. Sturm. Algorithms for plane-based pose estimation. In Conference on Computer Vision and Pattern Recognition (CVPR 2000 ), pages 1706–1711, 2000.3 [40] C. Sweeney, T. Sattler, T. Hollerer, M. Turk, and M. Polle- feys. Optimizing the viewing graph for Structure-from- Motion. In IEEE International Conference on Computer Vi- sion (ICCV 2015), pages 801–809, Washington, DC, USA, 2015.1 [41] R. Toldo, R. Gherardi, M. Farenzena, and A. Fusiello. Hi- erarchical structure-and-motion recovery from uncalibrated images. Computer Vision and Image Understanding (CVIU 2015), 140(C):127–143, Nov. 2015.2 [42] K. Wilson and N. Snavely. Robust global translations with 1DSfM. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuyte- laars, editors, 13th European Conference on Computer Vi- sion (ECCV 2014), pages 61–75, 2014.1 [43] C. Wu. Towards linear-time incremental structure from mo- tion. In International Conference on 3D Vision (3DV 2013), pages 127–134, 2013.1,2,7,8 [44] C. Xu, L. Zhang, L. Cheng, and R. Koch. Pose estimation from line correspondences: A complete analysis and a se- ries of solutions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI 2016), June 2016.2 [45] C. Zach, M. Klopschitz, and M. Pollefeys. Disambiguat- ing visual relations using loop constraints. In Conference on Computer Vision and Pattern Recognition (CVPR 2010), 2010.3