<<

Multiple Instance Learning with Bags

Boris Babenko1 [email protected] Nakul Verma1 [email protected] Piotr Doll´ar2 [email protected] Serge Belongie1 [email protected] 1University of California, San Diego, CA 2California Institute of Technology, Pasadena, CA

Abstract rather than for each instance. A bag is labeled positive In many machine learning applications, la- if it contains at least one positive instance. In recent beling every instance of data is burdensome. years MIL has received significant attention in terms Multiple Instance Learning (MIL), in which of both algorithm design and applications (Maron & training data is provided in the form of la- Lozano-Perez, 1998; Andrews et al., 2002; Zhang & beled bags rather than labeled instances, is Goldman, 2002; Viola et al., 2005). one approach for a more relaxed form of su- Theoretical PAC-style analysis of MIL problems has pervised learning. Though much progress has also seen progress in the last decade (Auer et al., 1997; been made in analyzing MIL problems, ex- Blum & Kalai, 1998; Long & Tan, 1998; Sabato & isting work considers bags that have a finite Tishby, 2009; Sabato et al., 2010). Typical analysis number of instances. In this paper we argue formulates the MIL problem as follows: a fixed num- that in many applications of MIL (e.g. im- ber of instances, r, is drawn from an instance I age, audio, etc.) the bags are better modeled to form a bag. The sample for bag classi- as low dimensional in high dimen- fication is then analyzed in terms of the bag size (r). sional feature space. We show that the ge- Most of the work has focused on reducing the ometric structure of such manifold bags af- dependence on r under various settings. For example, fects PAC-learnability. We discuss how a Blum & Kalai (1998) showed that if one has access to learning algorithm that is designed for finite a noise tolerant learner and the bags are formed by sized bags can be adapted to learn from mani- drawing r independent samples from a fixed distribu- fold bags. Furthermore, we propose a simple tion over I, then the sample complexity grows linearly heuristic that reduces the memory require- with r. Recently, Sabato & Tishby (2009) showed that ments of such algorithms. Our experiments if one can minimize the empirical error on bags, then on real-world data validate our analysis and even if the instances in a bag have arbitrary statistical show that our approach works well. dependence, sample complexity grows only logarithmi- cally with r. 1. Introduction The above of work is rather restrictive. Any de- pendence on r makes it impossible to apply these gen- Traditional supervised learning requires example/label eralization bounds to problems where bags have in- pairs during training. However, in many domains la- finitely many instances – a typical case in practice. beling every single instance of data is either tedious Consider the following motivating example: we would or impossible. The Multiple Instance Learning frame- like to predict whether an contains a face (as work (MIL), introduced by Dietterich et al. (1997), in Viola et al., 2005). Putting this in the MIL frame- provides a general paradigm for a more relaxed form work, a bag is an entire image, which is labeled pos- of supervised learning: instead of receiving exam- itive if and only if there is a face in that image. The ple/label pairs, the learner gets unordered sets of in- individual instances are image patches. Notice that stances, or bags, and labels are provided for each bag, in this scenario the instances collectively form (a dis- Appearing in Proceedings of the 28 th International Con- crete approximation to) a low-dimensional manifold; ference on Machine Learning, Bellevue, WA, USA, 2011. see Figure 1. Here we expect the sample complexity Copyright 2011 by the author(s)/owner(s). to scale with the geometric properties of the underly- Multiple Instance Learning with Manifold Bags

Negative region Positive region

Figure 1. Manifold Bags: In this example the task is to predict whether an image contains a face. Each bag is an image, and individual instances are image patches of a fixed size. Examples of two positive bags b1 and b2 (left), and a visualization of the instance space I (right) are shown. The two bags trace out low-dimensional manifolds in I; in this case the manifold is two since there are two (the x and y location of the image patch). The green regions on the manifolds indicate the portion of the bags that is positive. ing manifold bag rather than the number of instances 2. Problem Formulation and Analysis per bag. Let I be the domain of instances (for the purposes of N This situation arises in many other MIL applications our discussion we assume it to be R for some large where some type of sliding window is used to break up N), and let B be the domain of bags. Here we impose an object into many overlapping pieces: images (An- a structural constraint on B: each bag from B is a low drews et al., 2002; Viola et al., 2005), video (Ali & dimensional manifold over the instances of I. More Shah, 2008; Buehler et al., 2009), audio (Saul et al., formally, each bag b ∈ B is a smooth bijection1 from 2001; Mandel & Ellis, 2008), and sensor data (Stikic [0, 1]n to some subset of I (n  N). The geometric & Schiele, 2009). Consider also the original molecule properties of such bags are integral to our analysis. We classification task that motivated Dietterich et al. will thus do a quick review of the various properties of (1997) to develop MIL, where a bag corresponds to manifolds that will be useful in our discussion. a molecule, and instances are different shapes that molecule can assume. Even in this application, “as 2.1. Differential Basics the molecule changes its shape, it traces out a mani- n fold through [feature] space” (Maron & Lozano-Perez, Let f be a smooth bijective mapping from [0, 1] to N 1998). Thus, manifold structure is an integral aspect M ⊂ R . We call the image of f (i.e. M) a manifold of these problems that needs to be taken into account (note that M is compact and has a ). The in MIL analysis and algorithm design. dimension of the domain (n in our case) corresponds to the latent degrees of freedom and is typically re- In this work we analyze the MIL framework for bags ferred to as the intrinsic dimension of the manifold M containing potentially infinite instances. In this set- (denoted by dim(M)). ting a bag is drawn from a bag distribution, and is labeled positive if it contains at least one positive in- Since one of the key quantities in classic analysis of stance. In order to have a tractable analysis, we im- MIL is the bag size, we require a similar quantity to pose a structural constraint on the bags: we assume characterize the “size” of M. One natural way to char- that bags are low dimensional manifolds in the in- acterize this is in terms of the of M. The vol- stance space, as discussed above. We show that the ume (denoted by vol(M)) is given by the quantity R p T det(J J)du1 . . . dun, where J is the N × n geometric structure of such bags is intimately related u1,...,un to the PAC-learnability of MIL problems. We inves- Jacobian of the f, with individual en- tigate how learning is affected if we have have access tries defined as Jij := ∂fi/∂uj. to only a limited number of instances per manifold Unlike a finite size bag, a finite volume manifold bag. We then discuss how existing MIL algorithms, N M ⊂ R can be arbitrarily “complex” – it can twist that are designed for finite sized bags, can be adapted and turn in all sorts of ways in the surrounding space. to learn from manifold bags efficiently using an itera- We therefore need to also get a handle on its curvi- tive querying heuristic. Our experiments on real-world data (image and audio) validate the intuition of our 1Here we are only considering a restricted class of mani- n analysis and show that our querying heuristic works folds – those that are globally diffeomorphic to [0, 1] . This is only done for convenience. The results here are general- well in practice. izable to arbitrary (compact) manifolds. Multiple Instance Learning with Manifold Bags ness. Borrowing the notation from computational ge- Perhaps the most obvious way to bound err(h¯) in ¯ ometry literature, we can characterize the complexity terms of err(c h, Sm) is by analyzing the VC-dimension of M via its condition number (see Niyogi et al., 2006). of the bag hypotheses, VC(H), and applying the stan- We say that the condition number of M (denoted by dard VC-bounds (see e.g. Vapnik & Chervonenkis, 1 cond(M)) is τ , if τ is the largest number such that 1971). While finding the VC-dimension of the bag the normals of r < τ at any two distinct points hypothesis class is non-trivial, the VC-dimension of in M don’t intersect. One can bound the sectional the corresponding instance hypotheses, VC(H), is well of M at any by 1/τ. Hence, when τ is known for many popular choices of H. Sabato & large, the manifold is relatively flat and vice versa. Tishby (2009) showed that for finite sized bags the VC-dimension of bag hypotheses (and thus the gener- With these definitions, we can define a structured fam- alization error) can be bounded in terms of the VC- ily of bag . dimension of the underlying instance hypotheses. Al- Definition 1 We say that a bag space B belongs to though one might hope that this analysis could be car- class (V, n, τ), if for every b ∈ B, we have2 that ried over to bags of infinite size that are well struc- dim(b) = n, vol(b) ≤ V , and cond(b) ≤ 1/τ. tured, this turns out to not be the case.

In what follows, we will assume that B belongs to class 2.2.1. VC(H) is Unbounded for Arbitrarily (V, n, τ). We now provide our main results, with all the Smooth Manifold Bags supporting proofs in the Appendix. We begin with a surprising result which goes against 2.2. Learning with Manifold Bags our intuition that requiring bag should suf- fice in bounding VC(H). We demonstrate that re- Since we are interested in PAC-style analysis, we will quiring the bags to be low-dimensional, arbitrarily flat be working with a fixed hypothesis class H over the manifolds with fixed volume is not enough to get a instance space I (that is, each h ∈ H is of the form handle on generalization error even for one of the sim- h : I → {0, 1}). The corresponding bag hypothesis plest instance hypothesis classes (set of in class H over the bag space B (where each h¯ ∈ H N R ). In particular, is of the form h¯ : B → {0, 1}) is defined as the set of classifiers {h¯ : h ∈ H} where, for any b ∈ B, Theorem 2 For any V > 0, n ≥ 1, τ < ∞, let ¯ def h(b) = maxα∈[0,1]n h(b(α)). We assume that there is B contain all manifolds M such that dim(M) = n, ∗ some unknown instance classification rule h : I → vol(M) ≤ V , and cond(M) ≤ 1/τ (i.e. B is the {0, 1} that gives the true labels for all instances. largest member of class (V, n, τ)). Let H be the set of N m hyperplanes in (N > n). Then for any m ≥ 1, The learner gets access to m bag/label pairs (bi, yi)i=1, R there exists a set of m bags b , . . . , b ∈ B, such that where each bag bi is drawn independently from an un- 1 m the corresponding bag hypothesis class H (over the bag known but fixed distribution DB over B, and is labeled m def ∗ space B) realizes all possible 2 labelings. according to the MIL rule yi = maxα∈[0,1]n h (bi(α)). We denote a sample of size m as Sm. Thus, VC(H) is unbounded making PAC-learnability ¯ Our learner should ideally return the hypothesis h that seemingly impossible. To build intuition for this ap- def achieves the lowest bag generalization3 error: err(h¯) = parent richness of H, and possible alternatives to ¯ Eb∼DB [h(b) 6= y]. This, of course, is not possible as the bound the generalization error, let us take a quick 2 learner typically does not have access to the underly- look at the case of one-dimensional manifolds in R ing data distribution DB. Instead, the learner has ac- with halfspaces as our H. For any m, we can place a cess to the sample Sm, and can minimize the empirical set of m manifold bags in such a way that all labelings ¯ def 1 Pm ¯ are realizable by H (see Fig. 2 for an example where error: err(c h, Sm) = m i=1 11{h(bi) 6= yi}. Various PAC results relate these two quantities in terms of the m = 3; see Appendix A.1 for a detailed construction). properties of H. The key observation is that in order to label a bag 2Technically b is a function and not a manifold. For positive, the instance hypothesis needs to label just a readability, we will occasionally abuse the notation and single instance in that bag positive. Considering that use b to mean the manifold produced by the image of b in our bags have an infinite number of points, the positive the instance space I. region can occupy an arbitrarily small fraction of a 3One can also talk about the generalization error over instances. As noted in previous work (e.g. Sabato & positively labeled bag. This gives our bag hypotheses Tishby, 2009), PAC analysis of the instance error typically immense flexibility even when the underlying instance requires stronger assumptions. hypotheses are quite simple. Multiple Instance Learning with Manifold Bags

responding bag hypothesis class as Hr. Note that the true bag labels are still binary in this setting (i.e. de- termined by h∗). Similar to the VC-dimension, the “fat-shattering + – dimension” of a real-valued bag hypothesis class, fatγ (Hr), relates the generalization error to the em- Figure 2. Bag hypotheses over manifold bags have pirical error at margin γ (see for example Anthony & unbounded VC-dimension. Three bags (colored blue, Bartlett, 1999): green and red) go around the eight anchor points (shown m as black dots) that are arranged along a section of a . def 1 X errγ (h¯r,Sm) = 11{margin(h¯r(bi), yi) < γ}, (1) Notice that the hyperplanes tangent to the anchor points c m achieve all possible bag labelings. The hypothesis h shown i=1 above, for example, labels the red and blue bags positive,  def x − 1/2 y = 1 and the green bag negative. where margin(x, y) = . 1/2 − x otherwise

¯ Recall that it was not possible to bound generalization It seems that to bound err(h) we must ensure that error in terms of the instance hypotheses using VC di- a non-negligible portion of a positive bag be labeled mension. However, analogous to Sabato & Tishby’s positive. A natural way of accomplishing this is to analysis of finite size bags (2009), we can bound gen- use a real-valued version of the instance hypothesis eralization error for manifold bags in terms of the fat- class (i.e., classifiers of the form hr : I → [0, 1], and shattering dimension of instance hypotheses, fatγ (H). labels determined by thresholding), and requiring that In particular, we have the following: functions in this class (a) be smooth, and (b) label a positive bag with a certain margin. To understand why Theorem 3 Let B belong to class (V, n, τ). Let Hr these properties are needed, consider three ways that be λ-Lipschitz smooth (w.r.t. `2-norm), and Hr be hr can label the instances of a positive bag b as one the corresponding bag hypotheses over B. Pick any varies the latent parameter α (i.e., x-axis corresponds 0 < γ < 1 and m ≥ fatγ/16(Hr) ≥ 1. For any to instances, y-axis corresponds to classifier output): 0 < δ < 1, we have with probability at least 1 − δ over ¯ an i.i.d. sample Sm (of size m), for every hr ∈ Hr: 3smooth 2 smooth 3smooth 2 margin 3margin 3margin ¯ ¯ 1 1 1 err(hr) ≤ errc γ (hr,Sm) + + s 2 ! ½ ½ ½ n fat γ (Hr)  V m  1 1 – 16 2 O log 2 n + ln , (2) m γ τ0 m δ 0 0 0 τ γ γ where τ0 = min{ 2 , 8 , 8λ }.

In both the left and center panels, hr labels only a tiny Observe that the complexity term in Eq. (2) is inde- portion of the bag positive: in the first case hr barely pendent of the “bag size”; it has instead been replaced labels any instance above the threshold of 1/2, result- by the volume and other geometric properties of the ing in a small margin; in the second case, although manifold bags. The other term captures the sample the margin is large, hr changes rapidly along the bag. error for individual hypotheses at margin γ. Thus a Finally, in the right panel, when both the margin and natural strategy for a learner is to return a hypothesis smoothness conditions are met, a non-negligible por- that minimizes the empirical error while maximizing tion of b is labeled positive. the margin. We shall thus study how to bound the generalization error in this setting. 2.3. Learning from Queried Instances So far we have analyzed the MIL learner as a black 2.2.2. Learning with a Margin box entity, which can minimize the empirical bag er- Let Hr be the real-valued relaxation of H (i.e. each ror by somehow accessing the bags. Since the indi- hr ∈ Hr is now of the form hr : I → [0, 1]). In order to vidual bags in our case are low-dimensional manifolds ensure smoothness we impose a λ-Lipschitz constraint (with an infinite number of instances), we must also 0 on the instance hypotheses: ∀hr ∈ Hr, x, x ∈ I, consider how these bags are accessed by the learner. 0 0 |hr(x) − hr(x )| ≤ λkx − x k2. We denote the cor- Perhaps the simplest approach is to query ρ instances Multiple Instance Learning with Manifold Bags uniformly from each bag, thereby “reducing” the prob- ρ, per training bag. However, this may be impractical lem to standard MIL (with finite size bags) for which in terms of both speed and memory constraints. Sup- there are algorithms readily available (e.g. Maron & pose we have access to a black box MIL algorithm A Lozano-Perez, 1998; Andrews et al., 2002; Zhang & that can only train withρ ˆ < ρ instances per bag at Goldman, 2002; Viola et al., 2005). More formally, for once. We propose a procedure called Iterative Query- i i a bag sample Sm, let p1, . . . , pρ be ρ independent in- ing Heuristic (IQH), described in detail in Algorithm stance samples drawn uniformly from the (image of) 1 (the main steps are highlighted in blue). def S i bag bi ∈ Sm, and let Sm,ρ = i,j pj be the set of all instances. Assuming that our manifold bags have well- Algorithm 1 Iterative Querying Heuristic (IQH) conditioned boundaries, the following theorem relates Input: Training bags (b1, . . . , bm), labels (y1, . . . , ym), ¯ def the empirical error of sampled bags, errc γ (hr,Sm,ρ) = parameters T , ω andρ ˆ 1 Pm 11{ (max h(pi ), y ) < γ}, to the 0 0 m i=1 margin j∈[ρ] j i 1: Initialize Ii = ∅, hr as any classifier in Hr. generalization error. 2: for t = 1,...,T do 3: Query ω new candidate instances per bag: Theorem 4 Let B belong to class (V, n, τ). Let Hr t t−1 i i i Zi := Ii ∪ {p1, . . . , pω} where pj ∼ bi, ∀i. be λ-Lipschitz smooth (w.r.t. `2-norm), and Hr be t−1 4: Keep ρˆ highest scoring inst. using hr : the corresponding bag hypotheses over B. Pick any t t t t−1 t−1 0 Ii ⊂ Zi s.t. |Ii | =ρ ˆ and hr (p) ≥ hr (p ) 0 < δ1, δ2 < 1, then with probability at least 1−δ1 −δ2, t 0 t t for all p ∈ Ii , p ∈ Zi \ Ii . over the draw of m bags (Sm) and ρ instances per bag ¯t ¯ 5: Train hr with the selected instances: (Sm,ρ), for all hr ∈ Hr we have the following: ¯t t t hr ← A({I1 ...Im}, {y1 . . . ym}). 1 def 6: for Let κ = maxbi∈Sm {cond(∂bi)} (where ∂bi is the 7: T ¯T boundary of the manifold bag bi) and set τ1 = Return hr and the corresponding hr min{ τ , κ , γ , γ }. If 32 8 9λ 9 Notice that IQH uses a total of T ρˆ instances per bag !     for training (T iterations timesρ ˆ instances per itera- c0n mV ρ ≥ Ω V/τ1 n + ln n , tion). Thus, setting T ≈ ρ/ρˆ should achieve perfor- τ δ2 1 mance comparable to using ρ instances at once. The then free parameter ω controls how many new instances are ¯ ¯ considered in each iteration. err(hr) ≤ errc 2γ (hr,Sm,ρ) + s 2 ! The intuition behind IQH is as follows. For positive n fat γ (Hr)  V m  1 1 16 2 bags, we want to ensure that at least one of the queried O log 2 n + ln , m γ τ0 m δ1 instances is positive; hence we use the estimate τ γ γ of the classifier to select the most positive instances. where τ0 = min{ 2 , 8 , 8λ } and c0 is an absolute con- stant. For negative bags, we know all instances are negative. In this case we select the instances that are closest to Notice the effect of the two key parameters in the the decision boundary of our current classifier (corre- above theorem: the number of training bags, m, and sponding to the most difficult negative instances); the the number of queried instances per bag, ρ. Increas- motivation for this is similar to bootstrapping negative ing either quantity improves generalization – increas- examples (Felzenszwalb et al., 2009) and some active ing m drives down the error (via the complexity term), learning techniques (Cohn et al., 1994). We then use while increasing ρ helps improve the confidence (via these selected instances to find a better classifier. δ ). While ideally we would like both quantities to be 2 Thus one expects IQH to take advantage of a large large, increasing these parameters is, of course, com- number of instances per bag, without actually having putationally burdensome for a standard MIL learner. to train with all of them at one time. Note, however, the difference between m and ρ: in- creasing m comes at an additional cost of obtaining extra labels, whereas increasing ρ does not. We would 3. Experiments therefore like an algorithm that can take advantage of Recall that we have shown that the generalization er- using a large ρ while avoiding computational costs. ror is bounded in terms of key geometric properties of the manifold bags, such as curvature (1/τ) and volume 2.3.1. Iterative Querying Heuristic (V ). Here we will experimentally validate that gener- As we saw in the previous section, we would ideally alization error does indeed scale with these quantities, like to train with a large number of queried instances, providing an empirical lower bound. Additionally, we Multiple Instance Learning with Manifold Bags

(A) (B) (C) (D)

0.4 V=1.00 0.4 n=01 V=3.00 n=03 0.3 V=5.00 0.3 n=05

0.2 0.2 Bag error Bag error 0.1 0.1

0 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 Curvature (1/τ) Volume (V)

2 Figure 3. Synthetic Data Results: Examples of four synthetically generated bags in R with (A) low curvature and 1 (B) high curvature. (C) and (D): Test error scales with the manifold parameters: volume (V ), curvature ( τ ), and dimension (n). study how the choice of ρ affects the error, and show These results corroborate the general intuition of our that our Iterative Heuristic (IQH) is effective in reduc- analysis, and give an empirical verification that the ing the number of instances needed to train in each error indeed scales with the geometric properties of a iteration. In all our experiments we use a boosting al- manifold bag. gorithm for MIL called MILBoost (Viola et al., 2005) as the black box A; additional experiments with the 3.2. Real Data MI-SVM algorithm (Andrews et al., 2002) are avail- able in Appendix C. Both algorithms show similar In this section we present results on image and audio trends and we expect the same for any other choice of datasets. We will see that the generalization behavior A. Note that we use IQH only where specified. is consistent with our analysis across these different domains. We also study the effects of varying ρ on gen- eralization error, and see how using IQH helps achieve 3.1. Synthetic Data similar error rates with less instances per iteration. We begin with a carefully designed synthetic dataset, where we have complete control over the manifold cur- vature, volume and dimension, and study its effects on positive bags the generalization. The details on how we generate the dataset are provided in Appendix B; see Figure 3 (A) and (B) for examples of the generated manifolds. pad=32

For the first set of experiments, we study the inter- pad=16 play between the volume and curvatue while keeping the manifold dimension fixed. Here we generated one- dimensional of specified volume (V ) and curva- 2 ture (1/τ) in R . We set h∗ to be a vertical Figure 4. INRIA Heads: for our experiments we have and labeled the samples accordingly (see Appendix B). labeled the heads in the INRIA Pedestrian Dataset (Dalal For training, we used 10 positive and 10 negative bags & Triggs, 2005). We can construct bags of different vol- with 500 queried instances per bag (forming a good ume by padding the head region. The above figure shows cover); for testing we used 100 bags. Figure 3 (C) positive bags for two different amounts of padding. shows the test error, averaged over 50 trials, as we INRIA Heads. For these experiments we chose the vary these parameters. Observe that for a fixed V , as task of head detection (e.g. positive bags are images we increase 1/τ (making the manifolds more curvy) which contain at least one head). We used the INRIA generalization error goes up. Pedestrian Dataset (Dalal & Triggs, 2005), which con- For the next set of experiments, we want to understand tains both pedestrian and non-pedestrian images, to how manifold dimensionality affects the error. Here create an INRIA Heads dataset as follows. We manu- we set the ambient dimension to 10 and varied the ally labeled the location of the head in the pedestrian manifold dimension (with all other experiment settings images. The images were resized such that the size of as before). Figure 3 (D) shows how the test error scales the head is roughly 24 × 24 pixels; therefore, instances for different dimensional bags as we vary the volume in this experiment are image patches of that size. For (1/τ set to 1). each image patch we computed Haar-like features on various channels as in (Doll´aret al., 2009), which cor- responds to our instance space I. Multiple Instance Learning with Manifold Bags

Padding Querying IQH

pad=04 0.18 T=01 0.16 ρ=01 0.14 T=02 pad=08 0.16 0.14 ρ=02 T=04 pad=16 T=08 pad=32 ρ=04 0.12 0.12 0.14 T=16 ρ=08

0.1 0.12 Bag error 0.1 Bag error Bag error ρ=16

0.08 0.1 0.08 1 2 4 8 16 128 256 512 1024 128 256 512 1024

INRIA Heads ρ Number of training bags (m) Number of training bags (m) # Instances per bag per iteration ( )

0.18 0.2 T=01 pad=08 ρ=01 0.16 pad=16 T=02 0.25 0.18 0.14 ρ=02 T=04 pad=32 T=08 0.12 pad=64 ρ=04 0.16 T=16 0.2 0.1 ρ=08 0.14 Bag error

Bag error 0.08 Bag error ρ=16 0.12 0.15 0.06 0.1 1 2 4 8 16 128 256 512 1024 128 256 512 1024 ρ Number of training bags (m) Number of training bags (m) # Instances per bag per iteration ( ) TIMIT Phonemes

Figure 5. Image and Audio Results: three different experiments (columns) – varying padding (volume), number of queried instances, and number of IQH iterations – on two different datasets (rows); see text for details. Note that x-axes are in logarithmic scale. All reported results are averages over 5 trials.

Using the ground truth labels, we generated 2472 pos- Results. Our first set of experiments involved sweep- itive bags by cropping out the head region with differ- ing over the amount of padding (corresponding to ent amounts of padding (see Figure 4), which corre- varying the volume of bags). We train with a fixed sponds to changing the volume of the manifold bags. number of instances per bag, ρ = 4. Results for differ- For example, padding by 6 pixels results in a bag that ent training set sizes (m) are shown in the first column is a 30 × 30 pixel image. To generate negative bags we of Figure 5. As observed in the synthetic experiments, cropped 2000 random patches from the non-pedestrian we see that increasing the padding (volume) leads to images, as well as non-head regions from the pedes- poorer generalization for both datasets. This corrob- trian images. Unless otherwise specified, padding was orates our basic intuition that learning becomes more set to 16. difficult with manifolds of larger volume. TIMIT Phonemes. Our other application is in the In our second set of experiments, the goal was to see audio domain, and is analogous to the image data de- how generalization error is affected by varying the scribed above. The task here was to detect whether number of queried instances per bag, which compli- a particular phoneme is spoken in an audio clip (we ments Theorem 4. Results are shown in the middle arbitrarily chose the phoneme “s” to be the positive column of Figure 5. Observe the interplay between class). We used the TIMIT dataset (Garofolo et al., m and ρ: increasing either, while keeping the other 1993), which contains recordings of over 600 speakers fixed, drives the error down. Recall, however, that in- reading text; the dataset also contains phoneme an- creasing m also requires additional labels while query- notations. Bags in this experiment are audio clips, ing more instances per bag does not. The number of and instances are audio pieces of length 0.2 seconds instances indeed has a significant impact on general- (i.e. this is the size of our sliding window). As in ization – for example, in the audio domain, querying the image experiments, we had ground truth anno- more instances per bag can improve the error by up to tation for instances, and generated bags of various 15%. As per our analysis, these results suggest that to / by padding. We computed features fully leverage the training data, we must query many as follows: we split each sliding window into 25 mil- instances per bag. Since training with a large number lisecond pieces, computed Mel-frequency cepstral co- of instances can become computationally prohibitive, efficients (MFCC) (Davis & Mermelstein, 1980; Ellis, this further justifies the Iterative Querying Heuristic 2005) for each piece, and concatenated them to form a (IQH) described in Section 2.3.1. 104 dimensional feature vector for each instance. The Our final set of experiments evaluates the proposed reported padding amounts are in terms of a 5 mil- IQH method (see Algorithm 1). The number of train- lisecond step size (e.g., padding of 8 corresponds to 40 ing bags, m, was fixed to 1024, and the number of milliseconds of concatenation). Unless otherwise spec- candidate instances per iteration, ω, was fixed to 32 ified, padding was set to 64. for both datasets. Note that T = 1 corresponds to Multiple Instance Learning with Manifold Bags querying instances and training MILBoost once (i.e. no Cohn, D., , L., and Ladner, R. Improving generaliza- iterative querying). Results are shown in the right col- tion with active learning. Machine Learning, 1994. umn of Figure 5. These results show that our heuris- Dalal, N. and Triggs, B. Histograms of oriented tic works quite well. Consider the highlighed points in for human detection. In CVPR, 2005. both plots: using IQH with T = 4 and just 2 instances per bag during training we are able to achieve compa- Davis, S. and Mermelstein, P. Comparison of paramet- ric representations for monosyllabic word recognition in rable test error to the naive method (i.e. T = 1) with continuously spoken sentences. IEEE TASSP, 1980. 8 instances per bag. Thus, using IQH, we can obtain a good classifier while needing to use less memory and Dietterich, T.G., Lathrop, R.H., and Lozano-Perez, T. computational resources per iteration. Solving the multiple-instance problem with axis parallel rectangles. A.I., 1997. 4. Conclusion Doll´ar,P., Tu, Z., Perona, P., and Belongie, S. Integral channel features. In BMVC, 2009. We have presented a new formulation of MIL where bags are manifolds in the instance space, rather than Ellis, D.P.W. PLP and RASTA (and MFCC, and inver- finite sets of instances. This scenario often appears sion) in Matlab. http://www.ee.columbia.edu/~dpwe/ in practice, but has thus far been overlooked in the- resources/matlab/rastamat/, 2005. oretical analysis and algorithm design. We showed Felzenszwalb, P., Girshick, R., McAllester, D., and Ra- that manifold geometry is intimately related to PAC- manan, D. Object Detection with Discriminatively learnability for this formulation. Our experimental Trained Part-Based Models. PAMI, 2009. results corroborate the basic intuition of our analy- Garofolo, J.S. et al. TIMIT Acoustic-Phonetic Continuous sis. Our iterative querying technique enables us to Speech Corpus. http://www.ldc.upenn.edu/Catalog/ achieve good generalization error while needing to use CatalogEntry.jsp?catalogId=LDC93S1, 1993. less memory and computational resources, and should Long, P.M. and Tan, L. PAC Learning Axis-aligned thus be of immediate practical value. We hope that Rectangles with Respect to Product Distributions from our work encourages further research into leveraging Multiple-Instance Examples. Machine Learning, 1998. manifold structure in designing MIL algorithms. Mandel, M.I. and Ellis, D.P.W. Multiple-instance learning Acknowledgments for music information retrieval. In ISMIR, 2008. B.B. was supported by a 2010 Google Fellowship. N.V. Maron, O. and Lozano-Perez, T. A framework for multiple- was supported in part by NSF Grant CNS-0932403. instance learning. In NIPS, 1998. S.B. was supported by ONR MURI Grant #N00014- Niyogi, P., Smale, S., and Weinberger, S. Finding the ho- 08-1-0638 and NSF Grant AGS-0941760. mology of with high confidence from ran- dom samples. Disc. Computational Geometry, 2006. References Sabato, S. and Tishby, N. Homogeneous multi-instance Ali, S. and Shah, M. Human action recognition in videos learning with arbitrary dependence. In COLT, 2009. using kinematic features and multiple instance learning. PAMI, 2008. Sabato, S., Srebro, N., and Tishby, N. Reducing Label Complexity by Learning From Bags. In AISTATS, 2010. Andrews, S., Hofmann, T., and Tsochantaridis, I. Mul- tiple instance learning with generalized vector Saul, L.K., Rahim, M.G., and Allen, J.B. A statisti- machines. A.I., pp. 943–944, 2002. cal model for robust integration of narrowband cues in speech. Comp. Speech and Language, 15:175–194, 2001. Anthony, M. and Bartlett, P.L. Neural network learning: Theoretical foundations. Cambridge Univ Pr., 1999. Stikic, M. and Schiele, B. Activity recognition from sparsely labeled data using multi-instance learning. In Auer, P., Long, P. M., and Srinivasan, A. Approximating Location and Context Awareness. 2009. hyper-rectangles: Learning and pseudo-random sets. In Vapnik, V.N. and Chervonenkis, A.Y. On the uniform con- Proc. of ACM Symposium on Theory of Comp., 1997. vergence of relative frequencies of events to their proba- Blum, A. and Kalai, A. A Note on Learning from Multiple- bilities. Theory of Prob. and Its App., 1971. Instance Examples. Mach. Learning, 30(1):23–29, 1998. Viola, P., Platt, J.C., and Zhang, C. Multiple instance boosting for object detection. In NIPS, 2005. Buehler, P., Zisserman, A., and Everingham, M. Learn- ing sign language by watching TV (using weakly aligned Zhang, Q. and Goldman, S.A. EM-DD: An improved subtitles). In CVPR, 2009. multiple-instance learning technique. In NIPS, 2002. Clarkson, K. Tighter bounds for random projections of manifolds. Comp. Geometry, 2007. Multiple Instance Learning with Manifold Bags

A. Appendix: Proofs that are passing through the point pi, and rest of the bags labeled negative. Thus, realizing the arbitrary A.1. Proof of Theorem 2 labeling. We will show this for n = 1 and N = 2 (the generaliza- tion to N > n ≥ 1 is immediate). We first construct A.2. Proof of Theorem 3 m 2 anchor points p0, . . . , p2m−1 on a section of a circle 2 Before stating the proof, we give the following useful in R that will serve as a guide on how to place m bags 4 fact about manifolds with bounded volume and curva- b1, . . . , bm of dimension n = 1, volume ≤ V , and con- 2 ture. dition number ≤ 1/τ in R . We will then show that 2 the class of hyperplanes in R can realize all possible m Fact 5 [manifold covers – see Section 2.4 of 2 labelings of these m bags. N (Clarkson, 2007)] Let M ⊂ R be a compact def def Let V0 = min(V/2, π). Define anchor points pi = n-dimensional manifold with vol(M) ≤ V and V0i V0i m cond(M) ≤ 1/τ. Pick any 0 <  ≤ τ/2. There exists (2τ cos( 2τ2m ), 2τ sin( 2τ2m )) for 0 ≤ i ≤ 2 − 1. Ob- c0n n serve that the points pi are on a circle centered at the an -covering of M of size at most 2 (V/ ), where 2 origin of radius 2τ in R . c0 is an absolute constant. That is, there exists C ⊂ M such that |C| ≤ 2c0n(V/n) with the property: for all We use points p0, . . . , p2m−1 as guides to place m bags 2 p ∈ M, ∃ q ∈ C such that kp − qk ≤ . b1, . . . , bm in R that are contained entirely in the disc of radius 2τ centered at the origin and pass through the Now, for any domain X, real-valued hypothesis class i i anchor points as follows. Let km . . . k1 represent the H ⊂ [0, 1]X , margin γ > 0 and a sample S ⊂ X, define binary representation of the number i (0 ≤ i ≤ 2m−1). Place bag bj such that bj passes through the anchor def 0 i covγ (H,S) = {C ⊂ H | ∀h ∈ H, ∃h ∈ C, point pi, if and only if k = 1. (see figure below for a j max |h(s) − h0(s)| ≤ γ} visual example for 3 bags and 8 anchor points). Note s∈S that since, by construction, the arc (the dotted line in the figure) containing the anchor points has condition as a set of γ-covers of S by H. Let γ-covering number number at most 1/2τ with volume strictly less than of H for any integer m > 0 be defined as V , bags bj can be made to have condition number at def most 1/τ with volume at most V . N∞(γ, H, m) = max min |C|. S⊂X:|S|=m C∈covγ (H,S)

(100) (011) We will first relate the covering numbers of Hr and Hr (101) (010) with the fat-shattering dimension in the following two lemmas. (110) (001) Lemma 6 [relating hypothesis cover to the fat- shattering dimension – see Theorem 12.8 (An- (111) (000) thony & Bartlett, 1999)] Let H be a set of real functions from a domain X to the [0, 1]. Let Figure 6. Placement of arbitrarily smooth bags γ > 0. Then for m ≥ fatγ/4(H), along a section of a . Three bags (colored blue, green and red) go around the eight anchor points p0, . . . , p7 fat (H) log 4em 2 γ/4 fat (H)γ in such a way that the hypothesis class of hyperplanes can N∞(γ, H, m) < 2 4m/γ γ/4 . realize all possible bag labelings. Lemma 7 [adapted from Lemma 17 of (Sabato 2 It is clear that hyperplanes in R can realize any pos- & Tishby, 2009)] Let Hr be an instance hypothesis sible labeling of these m bags. Say, we want some class such that each hr ∈ Hr is λ-lipschitz (w.r.t. `2- arbitrary labeling (+1, +1, 0,..., +1). We look at the norm), and let Hr be the corresponding bag hypothesis number i with the same bit representation. Then a class over B that belongs to the class (V, n, τ). For any hyperplane that is to the circle (centered at γ > 0 and m ≥ 1, we have the origin and radius 2τ) at the anchor point pi, labels N (2γ, H , m) ≤ N (γ, H , m2c0n(V/n)), pi positive, and all other pk’s negative. Note that this ∞ r ∞ r hypothesis will also label exactly those bags bj positive τ γ γ where  = min{ 2 , 2 , 2λ }, and c0 is an absolute con- 4Volume of a 1-dimensional manifold is its length. stant. Multiple Instance Learning with Manifold Bags

Proof. Let S = {b1, . . . , bm} be a set of m manifold Lemma 8 [generalization error bound for real- τ γ γ bags. Set  = min{ 2 , 2 , 2λ }. For each bag bi ∈ S, let valued functions – Theorem 10.1 of (Anthony & Ci be the smallest -cover of (the image of) bi (by Fact Bartlett, 1999)] Suppose that F is a set of real-valued c0n n 5, we know that |Ci| ≤ 2 (V/ ) for some absolute functions defined on the domain X. Let D be any prob- constant c0). ability distribution on Z = X × {0, 1}, 0 ≤  ≤ 1, real γ > 0 and integer m ≥ 1. Then, ∪ def ∪ Define S = ∪iCi and let R ∈ covγ (Hr,S ) be some ∪ ¯   γ-cover of S . Now, for any hr ∈ Hr, let hr ∈ Hr Pr ∃f ∈ F : err(f) ≥ errc γ (f, Sm) +  denote the corresponding bag classifier, and define Sm∼D 2 def γ  − m/8 h (C ) = max h (c) as the maximum attained by ≤ 2N∞ ,F, 2m e , er i c∈Ci r 2 hr on the sample Ci. Then, since hr is λ-lipschitz (w.r.t. `2-norm), we have for any bag bi and its corre- where Sm is an i.i.d. sample of size m from D, err(f) is the error of f with respect to D, and err (f, S ) is sponding -cover Ci, c γ m the empirical error of f with respect to Sm at margin ¯ γ. |hr(bi) − ehr(Ci)| ≤ λ. Combining Lemmas 8, 7 and 6, we have (for m ≥ ∪ 0 fat γ (Hr)): It follows that ∀x ∈ S : for any hr ∈ Hr and hr ∈ R 16 0 such that |hr(x) − hr(x)| ≤ γ (and the corresponding ¯ ¯0 bag classifiers hr and h in Hr), r  ¯ ¯ ¯  Pr ∃hr ∈ Hr : err(hr) ≥ errγ (hr,Sm) +  Sm∼DB c ¯ ¯0 max |hr(bi) − hr(bi)| γ  2 i∈[m] ≤ 2 N , H , 2m e− m/8 ∞ 2 r ¯ 0 = max |hr(bi) − ehr(Ci) + ehr(Ci) − ehr(Ci) γ c n n  −2m/8 i∈[m] ≤ 2 N , H , m2 0 (V/τ ) e ∞ 4 r 0 0 ¯0 c n + h (Ci) − h (bi)| c n 16e2 0 V m  er r 0 d log n 2 64 · 2 V m τ0 dγ − m/8 ≤ 2λ + γ ≤ 2γ. ≤ 4 2 n e , γ τ0

def where c is an absolute constant, d = fat γ (H ), and 0 0 16 r Also, note that for any hr ∈ Hr and hr ∈ R such that τ γ γ 0 ∪ ¯0 ¯ τ0 = min{ 2 , 8 , 8λ }. For d ≥ 1, the theorem follows. |hr(x) − hr(x)| ≤ γ (x ∈ S ), we have hr ∈ {hr|hr ∈ def ∪ R} = R¯. It follows that for any R ∈ covγ (Hr,S ), ¯ A.3. Proof of Theorem 4 {hr|hr ∈ R} ∈ cov2γ (Hr,S). Thus, We start with the following useful observations that ∪ will help in our proof. {Hr|Hr ∈ covγ (Hr,S )} ⊂ cov2γ (Hr,S). Notation: for any two points p and q on a M, Hence, we have • let D (p, q) denote the between ¯ G N∞(2γ, Hr, m) = max min |R| points p and q. S⊂B:|S|=m R¯∈cov2γ (Hr ,S) ¯ def 0 0 ≤ max min |R| • BG(p, ) = {p ∈ M|DG(p , p) ≤ } denote the S⊂B:|S|=m R¯∈{H¯ |H ∈cov (H ,S∪)} r r γ r geodesic centered at p of radius . = max min |R| S⊂B:|S|=m R∈cov (H ,S∪) γ r Fact 9 [manifold volumes – see Lemma 5.3 = max min |R| N ∪ (Niyogi et al., 2006)] Let M ⊂ be a compact S⊂I:|S|=|S |m R∈covγ (Hr ,S) R n-dimensional manifold with cond(M) ≤ 1/τ. Pick ≤ max min |R| def S⊂I:|S|=m2c0n(V/n) R∈cov (H ,S) γ r any p ∈ M and let A = M ∩ B(p, ), where B(p, ) c0n n N = N∞(γ, Hr, m2 (V/ )), is a Euclidean ball in R centered at p of radius . If A does not contain any boundary points of M, then n n n where c0 is an absolute constant. vol(A) ≥ (cos(arcsin(/2τ))) vol(B ), where B is n a Euclidean ball in R of radius . In particular, not- n c0n Now we can relate the empirical error with generaliza- ing that vol(B ) ≥  for some absolute constant c0n tion error by noting the following lemma. c0, if  ≤ τ, we have vol(A) ≥  . Multiple Instance Learning with Manifold Bags

Fact 10 [relating geodesic to ambi- Observe that since the choice of F was arbitrary, we ent Euclidean distances – see Prop. 6.3 (Niyogi have that for any F ∈ F (uniformly), there exists pi ∈ N et al., 2006)] Let M ⊂ R be a compact mani- {p1, . . . , pρ} such that pi ∈ F . The lemma follows. fold with cond(M) ≤ 1/τ. If p, q ∈ M such that τ Lemma 12 Let B belong to class (V, n, τ). Fix a sam- kp − qk ≤ 2 , then DG(p, q) ≤ 2kp − qk. def ple of size m {b1, . . . , bm} = Sm ⊂ B, and let ∂bi N Lemma 11 Let M ⊂ R be a compact n-dimensional denote the boundary of the manifold bag bi ∈ Sm. De- manifold with vol(M) ≤ V and cond(M) ≤ 1/τ. 1 def i i fine κ = maxbi∈Sm {cond(∂bi)}. Now let p1, . . . , pρ be Let µ(M) denote the uniform probability measure over the ρ independent instances drawn uniformly from (the def M. Define F(M, ) = {BG(p, ): p ∈ M and BG(p, ) image of) bi. Let Hr be a λ-lipschitz (w.r.t. `2-norm) contains no points from the boundary of M}, that is, τ κ hypothesis class. Then, for any  ≤ min{ 32 , 8 }, the set of all geodesic balls of radius  that are con- tained entirely in the interior of M. Let τ ≤ τ and h ¯ i i 0 Pr ∃hr ∈ Hr, ∃bi ∈ Sm : |hr(bi) − max hr(bi(pj))| > 9λ j∈[ρ] ρ ≥ 1. Let p1, . . . , pρ be ρ independent draws from c0n µ(M). Then, ≤ m2c0n(V/n)e−ρ /V ,   Pr ∃F ∈ F(M, τ0): ∀i, pi ∈/ F where c is an absolute constant. p1,...,pρ∼µ(M) 0 c0n c0n n −ρ(τ0 /V ) ≤ 2 (V/τ0 )e , Proof. Fix a bag bi ∈ Sm, and let M denote the manifold bi. Quickly note that cond(M) ≤ 1/τ. where c0 is an absolute constant. Define M def= {p ∈ M : min D (p, q) ≥ 2}. By Proof. Let M ◦ denote the interior of M (i.e., it 2 q∈∂M G recalling that cond(∂M) ≤ 1 and  ≤ min{ τ , κ }, it contains all points of M that are not at the bound- κ 32 8 follows that i) M2 is non-empty, ii) ∀x ∈ M \ M2, ary). Let q0 ∈ M be any fixed point such that τ0 ◦ miny∈M2 DG(x, y) ≤ 8. BG(q0, 2 ) ⊂ M . Then, by Facts 9 and 10 we know τ0 c0n that vol(BG(q0, 2 )) ≥ τ0 . Observing that M has Observe that for all p ∈ M2, BG(p, ) is in the interior τ0 volume at most V , we immediately get that BG(q0, 2 ) of M. Thus by applying Lemma 11, we have: occupies at least τ c0n/V fraction of M. Thus 0   Pr ∃p ∈ M2 : ∀i, pi ∈/ BG(p, ) c0n ρ p ,...,p ∼µ(M) h τ0 i  τ0  1 ρ Pr ∀i, pi ∈/ BG q0, ≤ 1 − . c0n c0n n −ρ( /V ) p1,...,pρ∼µ(M) 2 V ≤ 2 (V/ )e ,

τ0 where µ(M) denotes the uniform probability measure Now, let C ⊂ M be a ( 2 )-geodesic covering of M. c1n n on M. Using Facts 5 and 10, we can have |C| ≤ 2 (V/τ0 ) (where c is an absolute constant). Define C0 ⊂ C as ∗ def 1 Now for any hr ∈ Hr, let x = arg maxp∈M hr(p). τ0 ◦ the set {c ∈ C : BG(c, 2 ) ⊂ M }. Then by union Then with the same failure probability, we have bounding over points in C0, we have that there exists some pi ∈ {p1, . . . , pρ} such that ∗ h 0 τ0 i DG(pi, x ) ≤ 9. To see this, consider: Pr ∃c ∈ C : ∀i, pi ∈/ BG c, p ,...,p ∼µ(M) 2 1 ρ if x∗ ∈ M , D (x∗, p ) ≤  (for some p ∈ c n 2 G i i  τ 0 ρ ∗ ≤ |C0| 1 − 0 . {p1, . . . , pρ}), otherwise if x ∈ M \ M2, then exists ∗ V q ∈ M2 such that DG(x , q) ≤ 8.

Equivalently we can say that, with probability at least Noting that hr is λ-Lipschitz, and union bounding over 0 −τ c0nρ/V 0 0 1 − |C |e 0 , for all c ∈ C , there exists pi ∈ m bags, the lemma follows. 0 τ0 {p1, . . . , pρ} such that pi ∈ BG(c , ). 2 By Theorem 3 we have for any 0 < γ < 1, with prob- Now, pick any F ∈ F(M, τ0), and let q ∈ M denote its ability at least 1 − δ1 over the sample Sm, for every ¯ center (i.e., q such that BG(q, τ0) = F ). Then since C hr ∈ Hr: is a ( τ0 )-geodesic cover of M, there exists c ∈ C such 2 ¯ ¯ that DG(q, c) ≤ τ0/2. Also, note that c belongs to err(hr) ≤ errc γ (hr,Sm) + 0 ◦ the set C , since BG(c, τ0/2) ⊂ BG(q, τ0) = F ⊂ M . s ! 2 γ 0 −τ c0nρ/V n fat (Hr) 2  V m  1 1 Thus with probability ≥ 1−|C |e 0 , there exists 16 O log 2 n + ln , m γ τ m δ1 pi such that 0 τ γ γ pi ∈ BG(c, τ0/2) ⊂ BG(q, τ0) = F. where τ0 = min{ 2 , 8 , 8λ }. Multiple Instance Learning with Manifold Bags

Padding Querying IQH

T=01 pad=08 0.4 ρ=01 0.3 T=02 pad=16 ρ 0.35 pad=32 =02 T=04 0.25 0.35 pad=64 ρ=04 T=08 0.3 0.2 0.3 ρ=08 Bag error Bag error 0.15 0.25 Bag error 0.25

0.1 0.2

128 256 512 1024 128 256 512 1024 1 2 4 8 Number of training bags (m) Number of training bags (m) # Instances per bag per iteration (ρ) TIMIT Phonemes

Figure 7. Audio Results using MI-SVM: three different experiments (columns) – varying padding (volume), number of queried instances, and number of IQH iterations – on two different datasets (rows); see text for details. Note that x-axes are in logarithmic scale. All reported results are averages over 5 trials.

By applying Lemma 12 (with  set to 5. Terminate this process once we have a manifold τ κ γ γ τ1 = min{ 32 , 8 , 9λ , 9 }), it follows that if of volume V . c0n mV  ρ ≥ Ω (V/τ1 ) n + ln n , then with prob- τ1 δ2 ¯ ¯ Notice that this procedure can potentially result in a ability 1 − δ2: errc γ (hr,Sm) ≤ errc 2γ (hr,Sm,ρ), yielding the theorem. that intersects itself or has the condition number less than 1/τ. If this happens, we simply reject such a curve and generate another random curve until we B. Appendix: Synthetic Dataset have a well-conditioned manifold. Generation To generate a higher dimensional manifold, we extend 2 our 1-dimensional manifold (M ⊂ R ) in the extra di- mensions by taking a with a cube: M × [0, 1]n−1. Notice that the “cube”-extension does not alter the condition number (i.e. it remains 1/τ). Since the resulting manifold fills up only n + 1 dimen- sions, we randomly rotate it the . Now, to label the generated manifolds positive and negative, we first fix h∗ to be a vertical hyperplane N (w.r.t. the first coordinate) in R . To label a manifold b negative, we translate it such that the entire manifold lies in the negative region induced by h∗. And to label Figure 8. Synthetic bags. An example of a synthetic 1- it positive, we translate it such that a part of b lies in dimensional manifold of a specified volume (length) V and ∗ curvature 1/τ generated by our procedure. the positive region induced by h .

2 We generate a 1-dimensional manifold (in R ) of cur- C. Appendix: Additional Experiments vature 1/τ and volume (length) V as follows (see also with MI-SVM Figure 8). We repeated the suite of experiments on the TIMIT dataset with the MI-SVM algorithm (Andrews et al., 1. Pick a circle with radius τ, a point p on the circle, 2002) (implemented using the LIBSVM package5). and a random θ (less than π). Figure 7 shows the results. Upon comparison with 2. Choose a direction (either clockwise or counter- the MILBoost results in Figure 5 we observe that the clockwise) and trace out an arc of length θτ start- general trends are quite similar. This reinforces the ing at p and ending at point, say, q. fact that our results should generalize to most MIL algorithms. 3. Now pick another circle of the same radius τ that is tangent to the original circle at the point q.

4. Repeat the process of tracing out another arc on the new circle, starting at point q and going in 5 the reverse direction. http://www.csie.ntu.edu.tw/~cjlin/libsvm/