Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset
Menglin Jia?1, Mengyun Shi?1,4, Mikhail Sirotenko?3, Yin Cui?3, Claire Cardie1, Bharath Hariharan1, Hartwig Adam3, Serge Belongie1,2
1Cornell University 2Cornell Tech 3Google Research 4Hearst Magazines
1 Supplementary Material
In our work we presented the new task of instance segmentation with at- tribute localization. We introduced a new ontology and dataset, Fashionpedia, to further describe the various aspects of this task. We also proposed a novel evaluation metric and Attribute-Mask R-CNN model for this task. In the sup- plemental material, we provide the following items that shed further insight on these contributions: – More comprehensive experimental results (§ 1.1) – An extended discussion of Fashionpedia ontology and potential knowledge graph applications (§ 1.2) – More details of dataset analysis (§ 1.3) – Additional information of annotation process (§ 1.4) – Discussion (§ 1.5)
1.1 Attribute-Mask R-CNN
Per-class evaluation. Fig.1 presents detailed evaluation results per super- category and per category. In Fig.1, we follow the same metrics from COCO leaderboard (AP, AP50, AP75, APl, APm, APs, AR@1, AR@10, AR@100, ARs@100, ARm@100, ARl@100), with τIoU and τF 1 if possible. Fig.1 shows that metrics considering both constraint τIoU and τF 1 are always lower than us- ing τIoU alone across all the supercategories and categories. This further demon- strates the challenging aspect of our proposed task. In general, categories belong to “garment parts” have a lower AP and AR, comparing with “outerwear” and “accessories”. A detailed breakdown of detection errors is presented in Fig.2 for super- categories and three main categories. In terms of supercategories in Fashionpe- dia, “outerwear” errors are dominated by within supercategory class confusions (Fig. 2(a)). Within this supercategory class, ignoring localization errors would only raise AP slightly from 77.5 to 79.1 (+1.6). A similar trend can be observed in class “skirt”, which belongs to “outerwear” (Fig. 2(d)). Detection errors of “part” (Fig. 2(b) 2(e)) and “accessory”(Fig. 2(c) 2(f)) on the other hand, are
? equal contribution. 2 M. Jia et al.
AP AP50 AP70 APl APm APs AR@1 AR@10 AR@100 ARs@100 ARm@100 ARl@100 IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU Max
Min
Fig. 1: Detailed results (for masks) using Mask R-CNN with SpineNet-143 back- bone. We present the same metrics as COCO leaderboard for overall categories, three super categories for apparel objects, and 46 fine-grained apparel categories.
We use both constraints (for example, APIoU and APIoU+F1 ) if possible. For cat- egories without attributes, the value represents APIoU or ARIoU. “top” is short for “top, t-shirt, sweatshirt”. “head acc” is short for “headband, head covering, hair accessory” Fashionpedia 3 dominated by both background confusion and localization. “part” also has a lower AP in general, compared with other two super-categories. A possible rea- son is that objects belong to “part” usually have smaller sizes and lower counts.
overall-outerwear-all overall-part-all overall-accessory-all 1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
[.733] C75 [.293] C75 [.687] C75
0.4 [.775] C50 0.4 [.420] C50 0.4 [.781] C50
precision precision precision [.791] Loc [.478] Loc [.822] Loc [.912] Sim [.499] Sim [.841] Sim 0.2 [.915] Oth 0.2 [.524] Oth 0.2 [.871] Oth [.921] BG [.649] BG [.912] BG [1.00] FN [1.00] FN [1.00] FN 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 recall recall recall
(a) Super category: outerwear (b) Super category: parts (c) Super category: accessory
outerwear-skirt-all part-pocket-all accessory-tights, stockings-all 1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
[.741] C75 [.397] C75 [.766] C75
0.4 0.4 [.633] C50 0.4 [.824] C50
precision precision precision [.788] C50 [.795] Loc [.792] Loc [.878] Loc [.894] Sim [.802] Sim [.903] Sim 0.2 [.894] Oth 0.2 [.809] Oth 0.2 [.916] Oth [.911] BG [.921] BG [.941] BG [1.00] FN [1.00] FN [1.00] FN 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 recall recall recall
(d) Outerwear: skirt (e) Parts: pocket (f) Accessory: tights, stock- ings
Fig. 2: Main apparel detectors analysis. Each plot shows 7 precision recall curves where each evaluation setting is more permissive than the previous. Specifically, C75: strict IoU (τIoU = 0.75); C50: PASCAL IoU (τIoU = 0.5); Loc: localization errors ignored (τIoU = 0.1); Sim: supercategory False Positives (FPs) removed; Oth: category FPs removed; BG: background (and class confusion) FPs re- moved; FN: False Negatives are removed. The first row (overall-[supercategory]- [size]) contains results for three supercategories in Fashionpedia; the second row ([supercategory]-[category]-[size]) shows results for three fine-grained categories (one per supercategory). Legends present the area under each curve (corresponds to AP metric) in brackets as well
F1 score calculation. Since we measure the f1 score of predicted attributes and groundtruth attributes per mask, we consider the both options of multi-label multi-class classification with 294 classes for one instance, and binary classifica- tion for 294 instances. Multi-label multi-class classification is a straightforward task, as it is a common setting for most of the fine-grained classification tasks. In binary classification scenario, we consider the 1 and 0 of the multi-hot encoding of both results and ground-truth labels as the positive and negative classes re- spectively. There are also two averaging choices: “micro” and “macro”. “Micro” 4 M. Jia et al.
0.350
0.325
0.300 e r o
c micros s 0.275 macros 1 F binary-micros +
U binary-macros o I 0.250 P A 0.225
0.200
0.175
0.0 0.2 0.4 0.6 0.8 1.0 F1 score thresholds
F 1=τF 1 Fig. 3: APIoU+F 1 score with different τF 1. The value presented are average over τIoU ∈ [0.5 : 0.05 : 0.95]. We use “binary-macro” as our main metric
averaging calculates the score globally by counting the total true positives, false negatives and false positives. “Macro” averaging calculates the metrics for each attribute class and reports the unweighted mean. In sum, there are four options of f1-score averaging methods: 1) “micro”, 2) “macro”, 3) “binary-micro”, 4) “binary-macro”. F 1=τF 1 As shown in Fig.3, we present the APIoU+F 1, with τIoU averaged in the range of [0.5 : 0.05 : 0.95]. τF 1 is increased from 0.0 to 1.0 with a increment of F 1=τF 1 0.01. Fig.3 illustrates that as the value of τF 1 increases, APIoU+F 1 decreases in different rates given different choices of f1 score calculation. There are 294 attributes in total, and an average of 3.7 attributes per mask in Fashionpedia training data. It’s not surprising to observe that“Binary-micro” produces high f1-scores in general (higher than 0.97), as the APIoU+F 1 score only decreases if the τF 1 ≥ 0.97. On the other hand, “macro” averaging in multi-label multi-class classification scenario gives us extremely low f1-scores (0.01 − 0.03). This fur- ther demonstrates the room for improvement for localized attribute classification task. We used “binary-macros” as our main metric. Result visualization. Figure4 shows that our simple baseline model can detect most of the apparel categories correctly. However, it also produces false positives sometimes. For example, it segments legs as tights and stockings (Fig- ure4(f) ). A possible reason is that both objects have the same shape and stockings are worn on the legs. Predicting fine-grained attributes, on the other hand, is a more challeng- ing problem for the baseline model. We summarize several issues: (1) predict more attributes than needed: (Figure4(a) , (b) , (c) ); (2) fail to distinguish among fine-grained attributes: for example, dropped-shoulder sleeve (ground truth) v.s. set-in sleeve (predicted) (Figure4(e) ); (3) other false positives: Figure4(e) has a double-breasted opening, yet the model pre- dicted it as the zip opening. These results further show that there are rooms for improvement and fu- ture development of more advanced computer vision models on this instance segmentation with attribute localization task. Fashionpedia 5
@'"$ @'//"5 >"+ C)8)/$
!"#$% H"8<)$ @'//"5 D/)):),,F !'8<)$ !'8<)$ >"+ D/)):), >"+ C)8"+ 6'( !'8<)$,E !'8<)$,F 6*+2$%,E D/)):),,I A5)%% !"#$% 6*+2$%,E A5)%% 6*+2$%,F 6*+2$%
D2'),E >)/$ 6*+2$%,E 6*+2$%,F D2'),F !"#$% >78
!"#$% & '()#*#+,$-()&,./-,, !'8<)$,E &,,#*8<#"3)&,("$82,0('8<)$1 A5)%% &,,%*/2'7)$$)&,%-33)$5*8"/, >"+ &,,%*/2'7)$$)&,%-33)$5*8"/, D/)):),E, &,,#*8<#"3)&,%)$;*#,%/)):), >)/$ &,,%*/2'7)$$)&,%-33)$5*8"/,,, 0'()#*#+1 #*8<#"3)&,?)/$,0('8<)$1,,, '()#*#+,$-()&,B*(;7(,,, $)4$*/),("$$)5#&,(/"*#, /)#+$2&,?5*%$;/)#+$2, $)4$*/),("$$)5#&,(/"*#, /)#+$2&,3"4*,0/)#+$21 #*8<#"3)&,%/"%2,0('8<)$1, $)4$*/),("$$)5#&,(/"*#, 0("$$)5#1 0("$$)5#1 $)4$*/),("$$)5#&,(/"*#, #*8<#"3)&,875:)=, 0("$$)5#1 0("$$)5#1 0('8<)$1, #*8<#"3)&,./"(,0('8<)$1
&,%*/2'7)$$)&,%-33)$5*8"/,, &,#*8<#"3)&,%)$;*#,%/)):), &,#)8"+ &,%*/2'7)$$)&,%-33)$5*8"/,, D/)):),E, &,,#*8<#"3)&,%)$;*#,%/)):), D/)):),F, #*8<#"3)&,./"(,0('8<)$1 6*+2$%,E &,#*8<#"3)&,/)++*#+% /)#+$2&,?5*%$;/)#+$2, '()#*#+,$-()&,B*(;7(, $)4$*/),("$$)5#&,(/"*#, /)#+$2&,%2'5$,0/)#+$21, %*/2'7)$$)&,$*+2$,0.*$1 0("$$)5#1, /)#+$2&,3"4*,0/)#+$21 $)4$*/),("$$)5#&,(/"*#,,, @'//"5 &,#*8<#"3)&,%2*5$,08'//"51,, 0("$$)5#1 #*8<#"3)&,5)+7/"5,08'//"51 &,%*/2'7)$$)&,%-33)$5*8"/, &,#*8<#"3)&,("$82,0('8<)$1, 6*+2$% !'8<)$ %*/2'7)$$)&,$*+2$,0.*$1, #*8<#"3)&,./"(,0('8<)$1 /)#+$2&,3"4*,0/)#+$21,,, &,/)#+$2&,%2'5$,0/)#+$21, &,#)8)/$ &,%*/2'7)$$)&,%-33)$5*8"/,, D/)):),F, 6*+2$%,F &,#*8<#"3)&,/)++*#+% $)4$*/),("$$)5#&,(/"*#, C)8"+ ?"*%$/*#)&,2*+2,?"*%$, C)8
Fig. 4: Baseline results on the Fashionpedia validation set. Masks, bounding boxes, and apparel categories (category score > 0.6) are shown. Attributes from top 10 masks (that contain attributes) from each image are also shown. Correct predictions of objects and attributes are bolded
Result visualization on other datasets. Other fashion datasets such as ModaNet and DeepFashion2 also contain instance segmentation masks. Aside from the results presented in the main paper (see Table 3 of the main paper) on Fashionpedia, we present the qualitative analysis on the segmentation masks generated among Fashionpedia(Fig.4), ModaNet(Fig.5(a-f)) and DeepFash- ion2(Fig.5(g-l)) datasets. Photos of the first row in Fig.5 are from ModaNet. They show that the quality of the generated masks on ModaNet is fairly good and comparable to Fashionpedia in general (Fig.5(a)). We also have a couple of observations of the failure cases: (1) fail to detect apparel objects: for exam- ple, the shoe from Fig.5(c) is not detected. Parts of the pants (Fig.5(c)) and coat (Fig.5(d)) are not detected; (2) fail to detect some categories: Fig.5(e) shows that the shoes on the shoe rack and right foot are not detected, possi- bly due to a lack of such instances in the ModaNet training dataset. Similar to 6 M. Jia et al. !"#$%&'
!"#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!%#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ !'#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!(#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!)#$$$$$$ / (&&)*$+,-".
!*#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!+#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!,#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!-#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!.#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ !/#$$$$$$
Fig. 5: Baseline results on ModaNet and DeepFashion2 validation set
(a) (b)
Fig. 6: Generated masks on online-shopping images [2]. (a) and (b) show the same types of shoes in different settings. Our model correctly detects and categorizes the pair of shoes worn by a fashion model, yet mistakenly detects shoes as jacket and a bag in (b) Fashionpedia 7
Fashionpedia, ModaNet mostly consist of street style images. See Fig.6(b) for example predictions from model trained on Fashionpedia; 3) close-up images: ModaNet contains mostly full-body images. This might be the possible reason to the decreased quality of predicted masks on close-up shot like Fig.5(f). For DeepFashion 2 (Fig.5(g,h,k)), the generated segmentation masks tends to not tightly follow the contours of garments in the images. The main reason possibly is that the average number of vertices per polygon is 14.7 for Deep- fashion2, which is lower than Fashionpedia and ModaNet (see Table 2 in the main text). Our qualitative analysis also shows that: 1) the model will generate the segmentation masks of pants (Fig.5(i)) and tops (Fig.5(j)) that are not visible in the images. Both of them are covered by a jacket. And we find that in DeepFashion 2, some part of the garments which is covered by other objects are indeed annotated with segmentation masks; 2) better performance on ob- jects that are not on human body (Fig.5(l)): DeepFashion 2 contains many commercial-customer image pairs (both images with and without human body) in the training dataset. In contrast, both Fashionpedia and ModaNet contain more images with human body than images without human body in the train- ing datasets. Generalization to the other image domains. For Fashionpedia, we also inference on images found in online shopping websites, which usually displays a single apparel category, with or without a fashion model. We found out that the learned model works reasonably well if the apparel item is worn by a model (Fig.6).
1.2 Fashionpedia Ontology and Knowledge Graph
Fig.7 presents our Fashionpedia ontology in detail. Fig.8 and9 displays the training data mask counts per category and per attributes. Utilizing the proposed ontology and the image dataset, a large-scale fashion knowledge graph can be constructed to represent the fashion world in the product level. Fig. 10 illustrates a subset of the Fashionpedia knowledge graph. Apparel graphs. Integrating the main garments, garment parts, attributes, and relationships presented in one outfit ensemble, we can create an apparel graph representation for each outfit in an image. Each apparel graph is a struc- tured representation of an outfit ensemble, containing certain types of gar- ments. Nodes in the graph represent the main garments, garment parts, and attributes. Main garments and garment parts are linked to their respective at- tributes through different types of relationships. Figure 11 shows more image examples with apparel graphs. Fashionpedia knowledge graph. While apparel graphs are localized rep- resentations of certain outfit ensembles in fashion images, we can also create a single Fashionpedia knowledge graph (Figure 10). The Fashionpedia knowledge graph is the union of all apparel graphs and includes entire main garments, gar- ment parts, attributes, and relationships in the dataset. In this way, we are able to represent and understand fashion images in a more structured way. 8 M. Jia et al.
(a) Categories (b) Attributes (partially)
Fig. 7: Apparel categories (a) and fine-grained attributes (b) hierarchy in Fash- ionpedia
Apparel categories
104 Counts
103
102
hat tie belt skirtrivet coat sock vest bow shoe dress pantscollar lapelbead ruffle glovescarf hood cape sleeve pocket zipperjacket watchbuckle shorts flower sequin fringe tasselribbon neckline glasses sweater appliquehair acc* cardiganjumpsuitepaulette umbrella bag, wallet shirt, blouse leg warmer
tights, stockings
top, t-shirt, sweatshirt
Fig. 8: Mask counts per apparel categories in training data. “head acc” is short for “headband, head covering, hair accessory” Fashionpedia 9
We expect our Fashionpedia knowledge graph and the database to have ap- plicability to extending the existing knowledge graph (such as WikiData [1]) with novel domain-specific knowledge, improving the underlying fashion prod- uct recommendation system, enhancing search engine’s results for fashion visual search, resolving ambiguous fashion-related words for text search, and more.
1.3 Dataset Analysis
Fig. 11 shows more annotation examples, represented in the exploded views of annotation diagrams. Table1 displays the details about “not sure” and “not on the list” results during attribute annotation process. We present the result per super-categories of attributes. Label “not sure” means the expert annotator is uncertain about the choice given the segmentation mask. “Not on the list” means the annotator is certain that the given mask presents another attributes that is not presented in the Fashionpedia ontology. Other than “nicknames” (which is the specific name for a certain apparel category), less than 6% of the total masks account for the “not on the list” category. Fig. 12 and 13 also compare Fashionpedia and other images datasets in terms of image size and vertices per polygons. We compare image resolutions between Fashionpedia and four other segmen- tation datasets (COCO and LVIS share the same images). Fig. 12 shows that images in Fashionpedia have the most diverse image width and height. While ModaNet has the most consistent resolutions of images. Note that high resolu- tion images will burden the data loading process of training. With that in mind, we will release our dataset in both the resized and the original versions.
Apparel attributes (nickname)
104
103
Counts 102
101 kilt robe skort jeans gown parka blazer bolero kaftan trunks hoodie anorak kimono jodhpur culottes raincoat leggings camisole sundress bloomers crop (top) tank (top) tube (top) pea (coat) undershirt tunic (top) nightgown tutu (skirt) tea (dress) polo (shirt) slip (dress) tulip (skirt) halter (top) wrap (coat) sweatpants wrap (skirt) peg (pants) bow (collar) shift (dress) pea (jacket) peak (lapel) smock (top) polo (collar) boardshorts shirt (dress) sailor (shirt) crop (pants) shirt (collar) cargo (skirt) cap (sleeve) duffle (coat) bell (sleeve) dirndl (skirt) swing (coat) godet (skirt) windbreaker rugby (shirt) wrap (dress) gypsy (skirt) dress (coat ) tunic (dress) crop (jacket) puffer (coat) mao (jacket) puff (sleeve) cheongsams flap (pocket) set-in sleeve jabot (collar) shawl (lapel) capri (pants) skater (skirt) track (pants) trench (coat) duster (coat) prairie (skirt) welt (pocket) tulip (sleeve) sailor (collar) biker (jacket) poet (sleeve) sailor (pants) track (jacket) camo (pants) sarong (skirt) sheath (skirt) hobble (skirt) henley (shirt) cargo (pants) halter (dress) short (shorts) peasant (top) safari (jacket) skater (dress) rah-rah (skirt) blanket (coat) cargo (shorts) slash (pocket) nehru (jacket) booty (shorts) ringer (t-shirt) military (coat) puffer (jacket) tie-up (shorts) harem (pants) seam (pocket) sheath (dress) cargo (pocket) patch (pocket) raglan (t-shirt) jumper (dress) roll-up (shorts) classic (t-shirt) chanel (jacket) regular (collar) varsity (jacket) raglan (sleeve) tuxedo (jacket) norfolk (jacket) notched (lapel) lounge (shorts) banded (collar) trucker (jacket) bishop (sleeve) culotte (shorts) chelsea (collar) blouson (dress) sweater (dress) curved (pocket) shearling (coat) flamenco (skirt) chemise (dress) bomber (jacket) kimono (sleeve) bodycon (dress) ball gown (skirt) napoleon (lapel) accordion (skirt) oversized (lapel) mandarin (collar) oversized (collar) bermuda (shorts) peter pan (collar) teddy bear (coat) kangaroo (pocket) stand-away (collar) asymmetric (collar) hip-huggers (pants) leg of mutton (sleeve) classic military (jacket) circular flounce (sleeve) dropped-shoulder sleeve dolman (sleeve), batwing (sleeve) (a)
Fig. 9: Mask counts per attributes in training data, grouped by super categories. Best viewed digitally 10 M. Jia et al.
Apparel attributes (length) Apparel attributes (neckline type)
104
103 Counts Counts 103
102 midi v-neck u-neck sleeveless collarless hip (length) wrist-length mini (length) elbow-length oval (neck) floor (length) high (neck) boat (neck) cowl (neck) knee (length) maxi (length) crew (neck) short (length) micro (length) turtle (neck) round (neck) scoop (neck) halter (neck) one shoulder choker (neck) illusion (neck) keyhole (neck) surplice (neck) off-the-shoulder crossover (neck) square (neckline) queen anne (neck) plunging (neckline) three quarter (length) above-the-hip (length) straight across (neck) sweetheart (neckline) below the knee (length) above-the-knee (length) asymmetric (neckline) (a) (b)
Apparel attributes (opening type) Apparel attributes (waistline)
104 104
103
3
Counts 10 Counts
102 zip-up lace up low waist wrapping high waist no opening no waistline fly (opening) normal waist single breasted double breasted empire waistline dropped waistline toggled (opening) buckled (opening) chained (opening) basque (wasitline) (c) (d)
Apparel attributes (textile pattern) Apparel attributes (textile finishing, manufacturing techniques)
104 104
103
103 Counts Counts
102
102 dot slit floral plant pleat zebra stripe lining check tiered argyle giraffe cutout frayed paisley quilted ruched fair isle printed leopard rivet(a) cartoon washed bead(a) chevron cheetah burnout abstract peacock sequin(a) smocking gathering geometric distressed embossed perforated applique(a) camouflage toile de jouy plain (pattern) letters, numbers snakeskin (pattern) herringbone (pattern) houndstooth (pattern) no special manu_tech* (e) (f)
Apparel attributes (non-textile material type) Apparel attributes (silhouette) 105
104
104 103
2 103 10 Counts Counts
101
102 100 fur gem ivory bone wood metal suede plastic rubber feather crocodile shearling snakeskin bell peg tent flare circle a-line pencil baggy balloon peplum bootcut straight trumpet wide leg high low tight (fit) mermaid loose (fit) oversized curved (fit) regular (fit) bell bottom fit and flare symmetrical asymmetrical , no non-textile material (g) (h)
Fig. 9: Mask counts per attributes in training data, grouped by super categories (cont.). Best viewed digitally Fashionpedia 11
product_12 product_19 product_8 Apparel Categories Mini Circle product_9 (length) Fine-grained Attributes product_13 CONTAIN Plain Above product_15 Distressed / (pattern) CONTAIN CONTAIN CONTAIN Ripped -the Dolman CONTAIN product_6 (Sleeve) & CONTAIN CONTAIN CONTAIN -Knee CONTAIN CONTAIN
CONTAIN CONTAIN Symmetr CONTAIN CONTAIN CONTAIN Low CONTAINCONTAIN CONTAIN -ical CONTAIN waist / CONTAIN Regular (fit)CONTAIN CONTAIN
CONTAIN CONTAIN CONTAIN Normal HAS_ATTRI…
CONTAIN CONTAIN CONTAIN CONTAIN Circle Washed CONTAIN HAS_ATTRIBUTE Waist HAS_ATTRIBUTE Perfo CONTAIN HAS_ATTRIBUTE CONTAIN CONTAIN CONTAIN Skater CONTAIN -rated CONTAIN (Skirt) CONTAIN CONTAIN Distressed / CONTAIN CONTAIN CONTAIN CONTAIN CONTAIN Ripped CONTAIN Flare CONTAIN CONTAIN Mini Fly HAS_ATTRIBUTE CONTAIN CONTAIN Washed HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN CONTAIN (length) /
(Opening) HAS_ATTRIBUTE CONTAIN CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Above CONTAIN CONTAIN CONTAIN HAS_ATTRIBUTE Dropped-sh…CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE -the-k… CONTAIN HAS_ATTRIBUTE Flare Three HAS_ATTRIBUTE Shorts CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE product_10 CONTAIN HAS_ATTRIBUTE quarterCONTAIN Symm HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE CONTAIN Loose (fit) HAS_ATTRIBUTE HAS_ATTRIBUTE -etrical HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Biker HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE SlitHAS_ATTRIBUTEHAS_ATTRIBUTE Symmet CONTAIN (Jacket) / HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTECONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE Fly Skater HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE -rical CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTECONTAIN Normal HAS_ATTRIBUTE HAS_ATTRIBUTE Curved HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE (Opening) (Pocket) HAS_ATTRIBUTE HAS_ATTRIBUTE (Skirt) Waist / HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE product_20 HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Ringer HAS_ATTRIBUTE HAS_ATTRIB… HAS_ATTRIBUTE HAS_ATTRIBUTE Check / Plaid HAS_ATTRIBUTE / Tartan (T-shirt) HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE Napoleon Regular HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE Skirt CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE (Lapel) CONTAIN CoatHAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE (fit) CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Round HAS_ATTRIBUTE CONTAIN CONTAIN (Neck) / HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE Roundneck HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Stripe HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE Skirt HAS_ATTRIBUTE Quilted HAS_ATTRIBUTE product_11 HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE ShortsHAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Straight HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Zip-up HAS_ATTRIBUTE Bolero HAS_ATTRIBUTE HAS_ATTRIBUTE Tight (fit) / HAS_ATTRIBUTE CONTAIN CropHAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Slim (fit) / CONTAIN CONTAIN HAS_ATTRIBUTE Jacket / BlazerHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Skinny (fit) HAS_ATTRIBUTE HAS_ATTRIBUTE Three One CONTAIN (Jacket) HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE Above CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN CONTAINCONTAIN CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE shoulder HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN quarter CONTAIN Puffer -the HAS_ATTRIBUTE HAS_ATTRIBUTE Tops / T-Shirt HAS_ATTRIBUTE
CONTAIN HAS_ATTRIBUTE
CONTAIN HAS_ATTRIBUTE (Jacket) / HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN (length) / HAS_ATTRIBUTE HAS_ATTRIBUTE Down Above- HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTEHAS_ATTRIBUTE -hip Collarless HAS_ATTRIBUTEDress (Jacket) Sweatshirt HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE H… CONTAIN HAS_ATTRIBUTE CONTAIN Symmet CONTAIN the-… HAS_ATTRIBUTE Wrist- CONTAIN Straight HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE across HAS_ATTRIBUTE HAS_ATT… HAS_ATTRIBUTE -rical (Neck) HAS_ATTRIBUTE HAS_ATTRIBUTE length CONTAIN HAS_ATTRIBUTE Jacket HAS_ATTRIBUTE HAS_ATTRIBUTE Napoleon HAS_ATTRIBUTECONTAIN HAS_ATTRIBUTE Peplum HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN (Lapel) HAS_ATTRIBUTE CONTAIN / Blazer Pencil HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE
HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Plain CONTAIN CONTAIN Tiered / HAS_ATTRIBUTE CONTAIN CONTAIN CONTAINOff-the (pattern)HAS_ATTRIBUTE HAS_… layered HAS_ATTRIBUTE CONTAIN product_17 HAS_ATTRIBUTE HAS_ATTRIBUTE
CONTAIN CONTAIN -shoulder CONTAIN Wrist
CONTAIN
HAS_ATTRIBUTE CONTAIN CONTAIN Regular CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE product_4 CONTAIN CONTAIN HAS_ATTRIBUTE -length CONTAIN
HAS_ATTRIBUTE HAS_ATTRIBUTE (fit) HAS_A… CONTAIN CONTAIN CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTECONTAIN Set-in HAS_ATTRIBUTE Hip CONTAIN CONTAIN CONTAIN CONTAIN CONTAIN Sleeveless Leather / CONTAIN PleatCONTAIN / Faux-leather (length) HAS_ATTRIBUTE CONTAIN sleeve CONTAIN
CONTAIN pleated
CONTAIN CONTAIN CONTAIN product_1 HAS_ATTRIBUTE CONTAIN
CONTAIN CONTAIN Plain CONTAIN CONTAIN Hip (pattern) Set-in CONTAIN CONTAIN CONTAIN (length) CONTAIN Shirt sleeve EmpireCONTAIN Set-in CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN Biker (Collar)CONTAIN Waistline CONTAIN CONTAIN sleeve CONTAIN CONTAIN CONTAIN Oversized CONTAIN CONTAIN (Jacket) CONTAIN (Collar) Fur CONTAIN product_14 CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN CONTAIN CONTAIN CONTAIN CONTAIN Asym CONTAIN product_18 CONTAIN CONTAIN -metrical CONTAIN CONTAIN product_5 Collarless
CONTAIN
CONTAIN Coat CONTAIN HAS_ATTRIBUTE Zip-up Leather Asymmet Fur product_2 -rical product_3 product_16 product_7
Fig. 10: Fashionpedia Knowledge Graph: we present a subset of the Fashionpedia knowledge graph by aggregating 20 annotated products. The knowledge graph can be used as a tool for generating structural information
We also report the distribution of number of vertices per polygons in Fig. 13. This measures the annotation effort in mask annotation. Fashionpedia has the second-widest range, next to LVIS.
1.4 Fashionpedia dataset creation details Image collection. To avoid photo bias, all the images are randomly collected from Flickr and free license stock photo websites (Unsplash, Burst by Shopify, Free stocks, Kaboompics, and Pexels). The collected images are further verified manually by two fashion experts. Specifically, they check the scenes’ diversity
Table 1: Percentage of attributes in Fashionpedia broken down by super-class. “Tex finish, manu-tech.” is short for “Textile finishing, Manufacturing tech- niques”. Summaries of “not sure” and “not on the list” during attributes an- notations are also presented. It was calculated by the counts divided by the total masks with attributes. “not sure” is mainly due to occlusion inside the images, which cause some super-classes (such as waistline, opening type, and length) are unidentifiable in the images. The percentage of “not on the list” is less than 15%. This demonstrates the comprehensiveness of our Fashionpedia ontology Super-category class count not sure not on the list Length 15 12.79% 0.01% Nickname 153 9.15 % 12.76% Opening Type 10 32.69% 3.90% Silhouettes 25 2.90% 0.27% Tex finish, manu-tech 21 4.47% 1.34% Textile Pattern 24 2.18% 5.30% None-Textile Type 14 4.90% 4.07% Neckline 25 9.57% 3.38% Waistline 7 30.46% 0.17% 12 M. Jia et al.
%'()*'
=/(('$92?"*
K9"--2#3 <*"%%'% ID?7'E ;,"$ +,-% =7,9$% .*,,90*'#3$7 F,,%'01G2$4 5)%$9"?$01-"$$'9#4
!92#$'E J2#20*'#3$7 =*2(01G2$4 .*/01,-'#2#34 !*"2# =*2(01G2$4
K"%7'E IDGG*' IDGG*' IDGG*' 8'?6*2#' IDGG*' !,?6'$ .9"/'E
%'()*'
;"-' .*/01,-'#2#34 +,-% =7,'% !*"2# !"#$% 5#6*'0*'#3$7 !*"2# 8,9("*0:"2%$ =*2(01G2$40 <*"%%'% 5%/(('$92?"* =/(('$92?"* !*"2# =2#3*'C)9'"%$'E
=/(('$92?"* ;,**"9 =*''>' =*''>' !,?6'$ !,?6'$
%'()*'
B'*,:C$7'C6#''
A9'%% =7,'% =?"9G <*"%%'% A,D)*'0)9'"%$'E !"#$/7,%' @"?6'$
F,,%'01G2$4 ;7'?60H0!*"2E0
I'3D*"901G2$4 =/(('$92?"* -*"2#
=*''>' =*''>' F"-'*
!"#$%&'()*&+),-- !"9$0,G
+'L$2*'0.2#2%72#3 +'L$2*'0!"$$'9# =2*7,D'$$' M-'#2#30+/-'
F'#3$7 82?6#"(' K"2%$*2#'
Fig. 11: Example images and annotations from our dataset: the images are anno- tated with both instance segmentation masks and fine-grained attributes (black boxes) Fashionpedia 13
Fig. 12: Image size comparison among Fashionpedia, ModeNet, DeepFashion2, and COCO2017, LVIS. Only training images are shown. The Fashionpedia im- ages has the most diverse resolutions. Note that COCO2017 and LVIS have higher resolution images for annotation. The distribution presented here are the publicly available photos
DeepFashion2 ModaNet 101 COCO2017 LVIS Fashionpedia 10 1
10 3 Percent of polygon
0 250 500 750 1000 1250 1500 1750 Number of vertices per polygon
Fig. 13: The number of vertices per polygon. This represents the quality of masks and the efforts of annotators. Values in the x-axis were discretized for better visual effect. Y-axis is on log scale. Fashionpedia has the second widest range, next to LVIS 14 M. Jia et al. and make sure the clothing items were visible and annotatable in the images. The estimated image type breakdown is listed as follows: street style images (30% of the full dataset), celebrity events images (30%), runway show images (30%), and online shopping product images (10%). For gender distribution, the gender in 80% of images are female, and 20% of images are male.
Fig. 14: Example of different type (people position/gesture, full/half shot, occlu- sion, scenes, garment types, etc.) of images in Fashionpedia dataset
We did aim to address the issue of photographic bias in the image collection process. Our dataset includes the images that are not centered, not full shot and with occlusion (see examples in Fig. 14). Furthermore, our focus is to identify clothing items, not to identify people during the image collection process. Crowd workers and 10-day training for mask annotation. In the spirit of sharing the same apparel vocabulary for all the annotators, we prepared a detailed tutorial (with text descriptions and image examples) for each category and attributes in the Fashionpedia ontology (see Fig. 15 for an example). Before the official annotation process, we spent 10 days on training the 28 crowd workers for the following three main reasons. First, some apparel categories are commonly referred as other names in gen- eral. For example, “top” is a general term for “shirt”, “sweater”, “t-shirt”, “sweatshirt”. Some annotators can mistakenly annotate a “shirt” as a “top”. We need to train these workers so they have the same understanding of the proposed Fashionpedia ontology. Utilizing the prepared tutorials (see Fig. 15 for an example), we trained and educated annotators on how to distinguish among different apparel categories. Second, there are fine-grained differences among apparel categories. For ex- ample, we observed that some workers initially had difficulty in understanding Fashionpedia 15
Fig. 15: Annotation tutorial example for shirt and top the difference among different garment parts, such as ‘tassel’ and ‘fringe’. To help them understand the difference of these objects, we ask them to practice and identify more sample images before the annotation process. Fig. 16 shows our tutorials for these two categories. We specifically shows some correct and wrong examples of annotations. Third, we ask for the quality of annotations. In particular, we ask the an- notators to follow the contours of garments in the images as closely as possible. The polygon annotation process is monitored and verified for a few days before the workers started the actual annotation process. Quality control of debatable apparel categories. During the annotation process, we allow annotators to ask questions about the uncertain categories. Two fashion experts monitored the annotation process by answering questions, checking the annotation quality, and providing weekly feedback to annotators. 16 M. Jia et al.
!"#$%&'%()&*+,-%%%%%%% ./)),0$%1%!)/*+%2#3%$/%$)#0,%()&*+,
!"#$%&'%4#'',5-%%%%%%% ./)),0$%1%!)/*+%2#3%$/%$)#0,%4#'',5
Fig. 16: Annotation tutorial for fringe and tassel
Instead of asking annotators to rate their confidence level of each segmenta- tion mask, we asked them to send back all the uncertain masks to us during the annotation. The same two fashion experts made the final judgement and gave the feedback to the workers on these debatable or unsure fashion categories. Some examples of debatable or fuzzy fashion items that we have documented can be found in Figure 17.
1.5 Discussion
Does this dataset include the images or labels of previous datasets? We only include the previous datasets for comparison. Our dataset doesn’t in- tentionally use any images or labels from previous datasets. All the images and labels from Fashionpedia are newly collected and annotated. Who were the fashion experts annotating localized attributes in Fashionpedia dataset? The fashion experts are the 15 fashion graduate stu- dents that we recruited from one of the top fashion design institutes. For double- blind policy, we cannot mention the name of the university. But we will release the name of this university and the collaborators from this university in the final version of this paper. Instance segmentation v.s. semantic segmentation. We didn’t con- duct semantic segmentation experiments on our dataset for the following two reasons: 1) Although semantic segmentation is a useful task, we believe instance segmentation is more meaningful for fashion images. For example, if we need to distinguish the different shoe style of a fashion image containing 3 pair of dif- ferent shoes, instance segmentation (Figure 18(a)) can help us distinguish each shoe separately. However, semantic segmentation (Figure 18(b)) will mix all the Fashionpedia 17
Fig. 17: Example of debatable fashion items in Fashionpedia dataset. The ques- tions are asked by the crowdworkers. The answers are provided by two fashion experts
Fig. 18: Instance segmentation (left) and semantic segmentation (right)
shoe instances together. 2) Semantic segmentation is the sub-problem of instance segmentation. If we merge the same detected object class from our instance seg- mentation experimental result, it yields the results for semantic segmentation. Which image has the most annotated masks? In Fashionpedia dataset, the maximum number of segmentation masks in an image is 74 (Fig. 19). We find that most of the masks are belonging to “rivets” (garment parts). What’s the difference between Fashionpedia and other fine-grained datasets like CUB-200? We propose to localize fine-grained attributes within segmentation masks of images. This is a novel task with real-world application to the best of our knowledge. The differences between Fashionpedia and CUB are as follows: 1) CUB uses keypoints as annotation to indicate different loca- tions on birds, while Fashionpedia has segmentation masks of garments, garment parts, and accessories; 2) Fashionpedia attributes are associated with garment or garment part instances in images, whereas CUB provides global attributes, not associated with any specific keypoints. !"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479!"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#4795%#:#5%#:#
!"#$#%&'(#)&$*!"#$#%&'(#)&$* !"#$#%&'(#)&$*(+#,-()&./.!"#$#%&'(#)&$*(+#,-()&./.
0*,'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.80*,'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8
!"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479 5%#:#
18 M. Jia et al. iMat-Fashion Kaggle challenges To advance state-of-the-art of visual anal- ysis of clothing, we hosted two kaggle challenges (imaterialist-fashion) on Kaggle in 2019 1 and 2020 2 respectively.
1 https://www.kaggle.com/c/imaterialist-fashion-2019-FGVC6!"#$#%&'(#)&$* !"#$#%&'(#)&$*(+#,-()&./. 2 https://www.kaggle.com/c/imaterialist-fashion-2020-fgvc70*,'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8 !"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479 5%#:#
!"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479 5%#:#
!"#$#%&'(#)&$*73(8(*')$(.'8#!"#$#%&'(#)&$*(+#,-()&./. 73(8(*')$(.'8#$1(&"$.'/0/ 0*,'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8!"#$%#&'()#%$(*+,$,+$#'-"$.'/0$1(&"$'//,-('%$),-')(2#%$'&&3(45/6 !"#$#%&'(#)&$* !"#$#%&'(#)&$*(+#,-()&./.
0*,'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8
Fig. 19: The image with 74 masks in Fashionpedia dataset Fashionpedia 19
References
1. Vrandeˇci´c,D., Kr¨otzsch, M.: Wikidata: a free collaborative knowledge base (2014) 9 2. ZARA.com: Zara white leather flat ankle boots with top stitching size 5 bnwt, retrieved May 9, 2019 from https://www.zara.com 6