Supplementary Material
Total Page:16
File Type:pdf, Size:1020Kb
ICCV ICCV #465 #465 ICCV 2015 Submission #465. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. 000 054 001 055 002 056 003 Supplementary Material: 057 004 058 005 Webly Supervised Learning of Convolutional Networks 059 006 060 007 061 008 Anonymous ICCV submission 062 009 063 010 Paper ID 465 064 011 065 012 066 013 067 014 In the supplementary material we include: Somewhat to our surprise, we found the extra clean-up 068 step does not help improving the detection performance. In 015 1. Additional results on PASCAL VOC for ablation anal- 069 fact, the average precision dropped for most of the cate- 016 ysis. 070 017 gories, regardless of whether the bounding box information 071 018 2. Scene classification results. is used or not. We suspect one reason lies in the size of 072 019 the data: after the object localization step, the number of 073 3. Diagnosis results for webly supervised object detec- 020 images was cut in more than a half (∼0.67M compared to 074 tion using [4]. 021 ∼1.5M). Also better algorithms can be devised for cleaning 075 022 4. Lists of objects, scenes and attributes. up web images. 076 023 On the other hand, fine-tuning ImageNet pretrained 077 024 1. Additional Results on PASCAL VOC model to Google images gives slightly better result than 078 training from scratch. However, further investigation is still 025 For ablation analysis, we provide more results on the 079 needed here since IN-GoogleA has seen more images (Im- 026 PASCAL VOC 2007 detection challenge. Please refer to 080 ageNet 1M) than GoogleA. 027 Table1 for comparison. The newly added results are shown 081 028 at the bottom rows. Following the notation in the paper, 082 029 “NFT” means before fine-tuning, and “FT” means after 2. Scene Classification 083 030 fine-tuning (100K iterations with a step size of 20K). We 084 To further demonstrate the usage of CNN features di- 031 add three pairs of new results: 085 032 rectly learned from the web, we also conducted a set 086 033 GoogleO-CI “CI” stands for “Cleaned Images”. This net- of scene classification experiments on the MIT Indoor-67 087 034 work is obtained by fine-tuning GoogleO on the im- dataset [5]. For each image, we simply computed the fc7 088 035 ages obtained by our object localization algorithm (de- feature vector, which has 4096 dimensions. We did not use 089 036 scribed in Section 3.3). Here, the entire image is re- any data augmentation or spatial pooling technique, with 090 037 garded as clean if at least one object is found inside the only pre-processing step normalizing the feature vector 091 038 it. However, the extra location information (bounding to unit `2 length [6]. The default SVM parameters (C=1) 092 039 box) is not used - the input is still the full image. were fixed throughout the experiments. 093 Table2 summarizes the results on the default train/test 040 GoogleO-CB “CB” stands for “Cleaned images with 094 split. We can see our web based CNNs achieved very com- 041 Bounding boxes”. Similar to GoogleO-CI, the net- 095 petetive performances: all the three networks achieved an 042 work is fine-tuned on the cleaned images. However, 096 accuracy at least on par with ImageNet pretrained mod- 043 instead of using the entire image, the image patches 097 els. Fine-tuning on hard images enhanced the features, but 044 cropped by the discovered bounding boxes are fed into 098 adding scene-related categories gave a huge boost to 66:5 045 the network. Both GoogleO-CI and Google-CB were 099 (comparable to the CNN trained on Places database [9], 046 fine-tuned for 200K iterations, with the learning rate 100 68:2). This indicates CNN features learned directly from 047 reduced every 40K iterations. 101 048 the web are indeed generic. 102 049 IN-GoogleA To see if more data can help the performance, Moreover, since we can get semantic labels (e.g. actions) 103 050 we fine-tuned the ImageNet pretrained network [3] to other than objects or scenes from the web for free, webly 104 051 all the images downloaded from Google. The fine- supervised CNN bears a great potential to perform well on 105 052 tuning was performed for 400K iterations, reducing many relevant tasks - and the cost is just as low as providing 106 053 learning rate every 80K iterations. a category list to query for that domain. 107 1 ICCV ICCV #465 #465 ICCV 2015 Submission #465. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. 108 162 VOC 2007 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP 109 163 ImageNet-NFT [3] 57.6 57.9 38.5 31.8 23.7 51.2 58.9 51.4 20.0 50.5 40.9 46.0 51.6 55.9 43.3 23.3 48.1 35.3 51.0 57.4 44.7 110 164 111 GoogleO-NFT 57.1 59.9 35.4 30.5 21.9 53.9 59.5 40.7 18.6 43.3 37.5 41.9 49.6 57.7 38.4 22.8 45.2 37.1 48.0 54.5 42.7 165 112 GoogleA-NFT 54.9 58.2 35.7 30.7 22.0 54.5 59.9 44.7 19.9 41.0 34.5 40.1 46.8 56.2 40.0 22.2 45.8 36.3 47.5 54.2 42.3 166 Flickr-NFT 55.3 61.9 39.1 29.5 24.8 55.1 62.7 43.5 22.7 49.3 36.6 42.7 48.9 59.7 41.2 25.4 47.7 41.9 48.8 56.8 44.7 113 167 114 VOC-Scratch [1] 49.9 60.6 24.7 23.7 20.3 52.5 64.8 32.9 20.4 43.5 34.2 29.9 49.0 60.4 47.5 28.0 42.3 28.6 51.2 50.0 40.7 168 ImageNet-FT [3] 64.2 69.7 50.0 41.9 32.0 62.6 71.0 60.7 32.7 58.5 46.5 56.1 60.6 66.8 54.2 31.5 52.8 48.9 57.9 64.7 54.2 115 169 116 GoogleO-FT 65.0 68.1 45.2 37.0 29.6 65.4 73.8 54.0 30.4 57.8 48.7 51.9 64.1 64.7 54.0 32.0 54.9 44.5 57.0 64.0 53.1 170 GoogleA-NFT 64.2 68.3 42.7 38.7 26.5 65.1 72.4 50.7 28.5 60.9 48.8 51.2 60.2 65.5 54.5 31.1 50.5 48.5 56.3 60.3 52.3 117 Flickr-FT 63.7 68.5 46.2 36.4 30.2 68.4 73.9 56.9 31.4 59.1 46.7 52.4 61.5 69.2 53.6 31.6 53.8 44.5 58.1 59.6 53.3 171 118 172 GoogleO-CB-NFT 56.8 56.5 33.8 27.5 22.7 51.3 55.1 41.0 15.6 42.3 31.0 37.7 46.8 55.2 38.8 23.0 41.5 30.4 48.1 51.2 40.3 119 173 GoogleO-CB-FT 67.2 69.7 45.9 36.5 31.7 64.9 73.9 54.2 30.4 59.9 43.6 49.6 59.1 66.3 52.7 30.6 52.3 42.5 59.6 62.4 52.7 120 174 GoogleO-CI-NFT 55.7 57.1 33.6 27.4 22.6 55.6 57.3 39.3 17.6 39.7 35.0 35.5 46.9 57.0 39.5 21.0 43.1 31.1 49.4 52.1 40.8 121 175 GoogleO-CI-FT 63.2 67.0 41.2 34.7 30.7 66.7 73.1 56.5 30.2 58.0 44.6 50.4 57.1 67.4 52.6 31.2 53.1 41.9 60.1 61.9 52.1 122 176 IN-GoogleA-NFT 55.0 60.2 39.5 33.4 23.8 58.7 60.6 48.6 23.2 44.9 35.8 44.1 50.3 60.1 42.9 23.9 49.0 39.8 49.7 54.4 44.9 123 177 IN-GoogleA-FT 64.6 69.7 44.1 34.8 28.4 66.1 72.1 56.2 30.7 57.0 45.8 50.4 61.9 66.5 54.8 31.7 58.0 43.0 56.8 62.9 52.8 124 178 Table 1. Additional Results on VOC-2007 (PASCAL data used). 125 179 126 animals furniture person vehicles 180 100 100 100 100 127 Loc Loc Loc Loc 181 90 Sim 90 Sim 90 Sim 90 Sim Oth Oth Oth Oth 128 80 BG 80 BG 80 BG 80 BG 182 129 70 70 70 70 183 130 60 60 60 60 184 131 50 50 50 50 185 40 40 40 40 132 186 30 30 30 30 percentage of each type percentage of each type percentage of each type percentage of each type 133 20 20 20 20 187 134 10 10 10 10 188 0 0 0 0 135 25 50 100 200 400 800 1600 3200 25 50 100 200 400 800 1600 3200 25 50 100 200 400 800 1600 3200 25 50 100 200 400 800 1600 3200 189 136 total false positives total false positives total false positives total false positives 190 Figure 1.