<<

Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset

Menglin Jia?1, Mengyun Shi?1,4, Mikhail Sirotenko?3, Yin Cui?3, Claire Cardie1, Bharath Hariharan1, Hartwig Adam3, Serge Belongie1,2

1Cornell University 2Cornell Tech 3Google Research 4Hearst Magazines

1 Supplementary Material

In our work we presented the new task of instance segmentation with at- tribute localization. We introduced a new ontology and dataset, Fashionpedia, to further describe the various aspects of this task. We also proposed a novel evaluation metric and Attribute-Mask R-CNN model for this task. In the sup- plemental material, we provide the following items that shed further insight on these contributions: – More comprehensive experimental results (§ 1.1) – An extended discussion of Fashionpedia ontology and potential knowledge graph applications (§ 1.2) – More details of dataset analysis (§ 1.3) – Additional information of annotation process (§ 1.4) – Discussion (§ 1.5)

1.1 Attribute-Mask R-CNN

Per-class evaluation. Fig.1 presents detailed evaluation results per super- category and per category. In Fig.1, we follow the same metrics from COCO leaderboard (AP, AP50, AP75, APl, APm, APs, AR@1, AR@10, AR@100, ARs@100, ARm@100, ARl@100), with τIoU and τF 1 if possible. Fig.1 shows that metrics considering both constraint τIoU and τF 1 are always lower than us- ing τIoU alone across all the supercategories and categories. This further demon- strates the challenging aspect of our proposed task. In general, categories belong to “garment parts” have a lower AP and AR, comparing with “outerwear” and “accessories”. A detailed breakdown of detection errors is presented in Fig.2 for super- categories and three main categories. In terms of supercategories in Fashionpe- dia, “outerwear” errors are dominated by within supercategory class confusions (Fig. 2(a)). Within this supercategory class, ignoring localization errors would only raise AP slightly from 77.5 to 79.1 (+1.6). A similar trend can be observed in class “”, which belongs to “outerwear” (Fig. 2(d)). Detection errors of “part” (Fig. 2(b) 2(e)) and “accessory”(Fig. 2(c) 2(f)) on the other hand, are

? equal contribution. 2 M. Jia et al.

AP AP50 AP70 APl APm APs AR@1 AR@10 AR@100 ARs@100 ARm@100 ARl@100 IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU IoU + F1 IoU Max

Min

Fig. 1: Detailed results (for masks) using Mask R-CNN with SpineNet-143 back- bone. We present the same metrics as COCO leaderboard for overall categories, three super categories for apparel objects, and 46 fine-grained apparel categories.

We use both constraints (for example, APIoU and APIoU+F1 ) if possible. For cat- egories without attributes, the value represents APIoU or ARIoU. “” is short for “top, t-, sweatshirt”. “head acc” is short for “, head covering, hair accessory” Fashionpedia 3 dominated by both background confusion and localization. “part” also has a lower AP in general, compared with other two super-categories. A possible rea- son is that objects belong to “part” usually have smaller sizes and lower counts.

overall-outerwear-all overall-part-all overall-accessory-all 1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

[.733] C75 [.293] C75 [.687] C75

0.4 [.775] C50 0.4 [.420] C50 0.4 [.781] C50

precision precision precision [.791] Loc [.478] Loc [.822] Loc [.912] Sim [.499] Sim [.841] Sim 0.2 [.915] Oth 0.2 [.524] Oth 0.2 [.871] Oth [.921] BG [.649] BG [.912] BG [1.00] FN [1.00] FN [1.00] FN 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 recall recall recall

(a) Super category: outerwear (b) Super category: parts (c) Super category: accessory

outerwear-skirt-all part--all accessory-, -all 1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

[.741] C75 [.397] C75 [.766] C75

0.4 0.4 [.633] C50 0.4 [.824] C50

precision precision precision [.788] C50 [.795] Loc [.792] Loc [.878] Loc [.894] Sim [.802] Sim [.903] Sim 0.2 [.894] Oth 0.2 [.809] Oth 0.2 [.916] Oth [.911] BG [.921] BG [.941] BG [1.00] FN [1.00] FN [1.00] FN 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 recall recall recall

(d) Outerwear: skirt (e) Parts: pocket (f) Accessory: tights, stock- ings

Fig. 2: Main apparel detectors analysis. Each plot shows 7 precision recall curves where each evaluation setting is more permissive than the previous. Specifically, C75: strict IoU (τIoU = 0.75); C50: PASCAL IoU (τIoU = 0.5); Loc: localization errors ignored (τIoU = 0.1); Sim: supercategory False Positives (FPs) removed; Oth: category FPs removed; BG: background (and class confusion) FPs re- moved; FN: False Negatives are removed. The first row (overall-[supercategory]- [size]) contains results for three supercategories in Fashionpedia; the second row ([supercategory]-[category]-[size]) shows results for three fine-grained categories (one per supercategory). Legends present the area under each curve (corresponds to AP metric) in brackets as well

F1 score calculation. Since we measure the f1 score of predicted attributes and groundtruth attributes per mask, we consider the both options of multi-label multi-class classification with 294 classes for one instance, and binary classifica- tion for 294 instances. Multi-label multi-class classification is a straightforward task, as it is a common setting for most of the fine-grained classification tasks. In binary classification scenario, we consider the 1 and 0 of the multi-hot encoding of both results and ground-truth labels as the positive and negative classes re- spectively. There are also two averaging choices: “micro” and “macro”. “Micro” 4 M. Jia et al.

0.350

0.325

0.300 e r o

c micros s 0.275 macros 1 F binary-micros +

U binary-macros o I 0.250 P A 0.225

0.200

0.175

0.0 0.2 0.4 0.6 0.8 1.0 F1 score thresholds

F 1=τF 1 Fig. 3: APIoU+F 1 score with different τF 1. The value presented are average over τIoU ∈ [0.5 : 0.05 : 0.95]. We use “binary-macro” as our main metric

averaging calculates the score globally by counting the total true positives, false negatives and false positives. “Macro” averaging calculates the metrics for each attribute class and reports the unweighted mean. In sum, there are four options of f1-score averaging methods: 1) “micro”, 2) “macro”, 3) “binary-micro”, 4) “binary-macro”. F 1=τF 1 As shown in Fig.3, we present the APIoU+F 1, with τIoU averaged in the range of [0.5 : 0.05 : 0.95]. τF 1 is increased from 0.0 to 1.0 with a increment of F 1=τF 1 0.01. Fig.3 illustrates that as the value of τF 1 increases, APIoU+F 1 decreases in different rates given different choices of f1 score calculation. There are 294 attributes in total, and an average of 3.7 attributes per mask in Fashionpedia training data. It’s not surprising to observe that“Binary-micro” produces high f1-scores in general (higher than 0.97), as the APIoU+F 1 score only decreases if the τF 1 ≥ 0.97. On the other hand, “macro” averaging in multi-label multi-class classification scenario gives us extremely low f1-scores (0.01 − 0.03). This fur- ther demonstrates the room for improvement for localized attribute classification task. We used “binary-macros” as our main metric. Result visualization. Figure4 shows that our simple baseline model can detect most of the apparel categories correctly. However, it also produces false positives sometimes. For example, it segments legs as tights and stockings (Fig- ure4(f) ). A possible reason is that both objects have the same shape and stockings are worn on the legs. Predicting fine-grained attributes, on the other hand, is a more challeng- ing problem for the baseline model. We summarize several issues: (1) predict more attributes than needed: (Figure4(a) , (b) , (c) ); (2) fail to distinguish among fine-grained attributes: for example, dropped-shoulder sleeve (ground truth) v.s. set-in sleeve (predicted) (Figure4(e) ); (3) other false positives: Figure4(e) has a double-breasted opening, yet the model pre- dicted it as the zip opening. These results further show that there are rooms for improvement and fu- ture development of more advanced computer vision models on this instance segmentation with attribute localization task. Fashionpedia 5

@'"$ @'//"5 >"+ C)8)/$

!"#$% H"8<)$ @'//"5 D/)):),,F !'8<)$ !'8<)$ >"+ D/)):), >"+ C)8"+ 6'( !'8<)$,E !'8<)$,F 6*+2$%,E D/)):),,I A5)%% !"#$% 6*+2$%,E A5)%% 6*+2$%,F 6*+2$%

D2'),E >)/$ 6*+2$%,E 6*+2$%,F D2'),F !"#$% >78

!"#$% & '()#*#+,$-()&,./-,, !'8<)$,E &,,#*8<#"3)&,("$82,0('8<)$1 A5)%% &,,%*/2'7)$$)&,%-33)$5*8"/, >"+ &,,%*/2'7)$$)&,%-33)$5*8"/, D/)):),E, &,,#*8<#"3)&,%)$;*#,%/)):), >)/$ &,,%*/2'7)$$)&,%-33)$5*8"/,,, 0'()#*#+1 #*8<#"3)&,?)/$,0('8<)$1,,, '()#*#+,$-()&,B*(;7(,,, $)4$*/),("$$)5#&,(/"*#, /)#+$2&,?5*%$;/)#+$2, $)4$*/),("$$)5#&,(/"*#, /)#+$2&,3"4*,0/)#+$21 #*8<#"3)&,%/"%2,0('8<)$1, $)4$*/),("$$)5#&,(/"*#, 0("$$)5#1 0("$$)5#1 $)4$*/),("$$)5#&,(/"*#, #*8<#"3)&,875:)=, 0("$$)5#1 0("$$)5#1 0('8<)$1, #*8<#"3)&,./"(,0('8<)$1

&,%*/2'7)$$)&,%-33)$5*8"/,, &,#*8<#"3)&,%)$;*#,%/)):), &,#)8"+ &,%*/2'7)$$)&,%-33)$5*8"/,, D/)):),E, &,,#*8<#"3)&,%)$;*#,%/)):), D/)):),F, #*8<#"3)&,./"(,0('8<)$1 6*+2$%,E &,#*8<#"3)&,/)++*#+% /)#+$2&,?5*%$;/)#+$2, '()#*#+,$-()&,B*(;7(, $)4$*/),("$$)5#&,(/"*#, /)#+$2&,%2'5$,0/)#+$21, %*/2'7)$$)&,$*+2$,0.*$1 0("$$)5#1, /)#+$2&,3"4*,0/)#+$21 $)4$*/),("$$)5#&,(/"*#,,, @'//"5 &,#*8<#"3)&,%2*5$,08'//"51,, 0("$$)5#1 #*8<#"3)&,5)+7/"5,08'//"51 &,%*/2'7)$$)&,%-33)$5*8"/, &,#*8<#"3)&,("$82,0('8<)$1, 6*+2$% !'8<)$ %*/2'7)$$)&,$*+2$,0.*$1, #*8<#"3)&,./"(,0('8<)$1 /)#+$2&,3"4*,0/)#+$21,,, &,/)#+$2&,%2'5$,0/)#+$21, &,#)8)/$ &,%*/2'7)$$)&,%-33)$5*8"/,, D/)):),F, 6*+2$%,F &,#*8<#"3)&,/)++*#+% $)4$*/),("$$)5#&,(/"*#, C)8"+ ?"*%$/*#)&,2*+2,?"*%$, C)8

Fig. 4: Baseline results on the Fashionpedia validation set. Masks, bounding boxes, and apparel categories (category score > 0.6) are shown. Attributes from top 10 masks (that contain attributes) from each image are also shown. Correct predictions of objects and attributes are bolded

Result visualization on other datasets. Other datasets such as ModaNet and DeepFashion2 also contain instance segmentation masks. Aside from the results presented in the main paper (see Table 3 of the main paper) on Fashionpedia, we present the qualitative analysis on the segmentation masks generated among Fashionpedia(Fig.4), ModaNet(Fig.5(a-f)) and DeepFash- ion2(Fig.5(g-l)) datasets. Photos of the first row in Fig.5 are from ModaNet. They show that the quality of the generated masks on ModaNet is fairly good and comparable to Fashionpedia in general (Fig.5(a)). We also have a couple of observations of the failure cases: (1) fail to detect apparel objects: for exam- ple, the from Fig.5(c) is not detected. Parts of the pants (Fig.5(c)) and (Fig.5(d)) are not detected; (2) fail to detect some categories: Fig.5(e) shows that the on the shoe rack and right foot are not detected, possi- bly due to a lack of such instances in the ModaNet training dataset. Similar to 6 M. Jia et al. !"#$%&'

!"#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!%#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!&#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ !'#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!(#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!)#$$$$$$ / (&&)*$+,-".

!*#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!+#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!,#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!-#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$!.#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ !/#$$$$$$

Fig. 5: Baseline results on ModaNet and DeepFashion2 validation set

(a) (b)

Fig. 6: Generated masks on online-shopping images [2]. (a) and (b) show the same types of shoes in different settings. Our model correctly detects and categorizes the pair of shoes worn by a fashion model, yet mistakenly detects shoes as and a bag in (b) Fashionpedia 7

Fashionpedia, ModaNet mostly consist of images. See Fig.6(b) for example predictions from model trained on Fashionpedia; 3) close-up images: ModaNet contains mostly full-body images. This might be the possible reason to the decreased quality of predicted masks on close-up shot like Fig.5(f). For DeepFashion 2 (Fig.5(g,h,k)), the generated segmentation masks tends to not tightly follow the contours of garments in the images. The main reason possibly is that the average number of vertices per polygon is 14.7 for Deep- fashion2, which is lower than Fashionpedia and ModaNet (see Table 2 in the main text). Our qualitative analysis also shows that: 1) the model will generate the segmentation masks of pants (Fig.5(i)) and tops (Fig.5(j)) that are not visible in the images. Both of them are covered by a jacket. And we find that in DeepFashion 2, some part of the garments which is covered by other objects are indeed annotated with segmentation masks; 2) better performance on ob- jects that are not on human body (Fig.5(l)): DeepFashion 2 contains many commercial-customer image pairs (both images with and without human body) in the training dataset. In contrast, both Fashionpedia and ModaNet contain more images with human body than images without human body in the - ing datasets. Generalization to the other image domains. For Fashionpedia, we also inference on images found in online shopping websites, which usually displays a single apparel category, with or without a fashion model. We found out that the learned model works reasonably well if the apparel item is worn by a model (Fig.6).

1.2 Fashionpedia Ontology and Knowledge Graph

Fig.7 presents our Fashionpedia ontology in detail. Fig.8 and9 displays the training data mask counts per category and per attributes. Utilizing the proposed ontology and the image dataset, a large-scale fashion knowledge graph can be constructed to represent the fashion world in the product level. Fig. 10 illustrates a subset of the Fashionpedia knowledge graph. Apparel graphs. Integrating the main garments, garment parts, attributes, and relationships presented in one outfit ensemble, we can create an apparel graph representation for each outfit in an image. Each apparel graph is a struc- tured representation of an outfit ensemble, containing certain types of gar- ments. Nodes in the graph represent the main garments, garment parts, and attributes. Main garments and garment parts are linked to their respective at- tributes through different types of relationships. Figure 11 shows more image examples with apparel graphs. Fashionpedia knowledge graph. While apparel graphs are localized rep- resentations of certain outfit ensembles in fashion images, we can also create a single Fashionpedia knowledge graph (Figure 10). The Fashionpedia knowledge graph is the union of all apparel graphs and includes entire main garments, gar- ment parts, attributes, and relationships in the dataset. In this way, we are able to represent and understand fashion images in a more structured way. 8 M. Jia et al.

(a) Categories (b) Attributes (partially)

Fig. 7: Apparel categories (a) and fine-grained attributes (b) hierarchy in Fash- ionpedia

Apparel categories

104 Counts

103

102

tie skirtrivet coat vest bow shoe pantscollar lapelbead ruffle glovescarf sleeve pocket zipperjacket watchbuckle flower sequin fringe tasselribbon neckline appliquehair acc* cardiganjumpsuitepaulette bag, wallet shirt,

tights, stockings

top, t-shirt, sweatshirt

Fig. 8: Mask counts per apparel categories in training data. “head acc” is short for “headband, head covering, hair accessory” Fashionpedia 9

We expect our Fashionpedia knowledge graph and the database to have ap- plicability to extending the existing knowledge graph (such as WikiData [1]) with novel domain-specific knowledge, improving the underlying fashion prod- uct recommendation system, enhancing search engine’s results for fashion visual search, resolving ambiguous fashion-related words for text search, and more.

1.3 Dataset Analysis

Fig. 11 shows more annotation examples, represented in the exploded views of annotation diagrams. Table1 displays the details about “not sure” and “not on the list” results during attribute annotation process. We present the result per super-categories of attributes. Label “not sure” means the expert annotator is uncertain about the choice given the segmentation mask. “Not on the list” means the annotator is certain that the given mask presents another attributes that is not presented in the Fashionpedia ontology. Other than “nicknames” (which is the specific name for a certain apparel category), less than 6% of the total masks account for the “not on the list” category. Fig. 12 and 13 also compare Fashionpedia and other images datasets in terms of image size and vertices per polygons. We compare image resolutions between Fashionpedia and four other segmen- tation datasets (COCO and LVIS share the same images). Fig. 12 shows that images in Fashionpedia have the most diverse image width and height. While ModaNet has the most consistent resolutions of images. Note that high resolu- tion images will burden the data loading process of training. With that in mind, we will release our dataset in both the resized and the original versions.

Apparel attributes (nickname)

104

103

Counts 102

101 kilt bolero kaftan trunks anorak kimono jodhpur bloomers crop (top) tank (top) tube (top) pea (coat) (top) (skirt) tea (dress) polo (shirt) (dress) tulip (skirt) halter (top) (coat) wrap (skirt) peg (pants) bow (collar) shift (dress) pea (jacket) peak () smock (top) polo (collar) shirt (dress) sailor (shirt) crop (pants) shirt (collar) cargo (skirt) (sleeve) duffle (coat) bell (sleeve) dirndl (skirt) swing (coat) godet (skirt) rugby (shirt) wrap (dress) gypsy (skirt) dress (coat ) tunic (dress) crop (jacket) puffer (coat) mao (jacket) puff (sleeve) cheongsams flap (pocket) set-in sleeve (collar) (lapel) capri (pants) skater (skirt) track (pants) trench (coat) (coat) prairie (skirt) welt (pocket) tulip (sleeve) sailor (collar) biker (jacket) poet (sleeve) sailor (pants) track (jacket) camo (pants) (skirt) sheath (skirt) hobble (skirt) henley (shirt) cargo (pants) halter (dress) short (shorts) peasant (top) safari (jacket) skater (dress) rah-rah (skirt) blanket (coat) cargo (shorts) slash (pocket) nehru (jacket) booty (shorts) ringer (t-shirt) military (coat) puffer (jacket) tie-up (shorts) harem (pants) seam (pocket) sheath (dress) cargo (pocket) patch (pocket) raglan (t-shirt) (dress) roll-up (shorts) classic (t-shirt) chanel (jacket) regular (collar) varsity (jacket) raglan (sleeve) tuxedo (jacket) norfolk (jacket) notched (lapel) lounge (shorts) banded (collar) trucker (jacket) bishop (sleeve) culotte (shorts) chelsea (collar) (dress) sweater (dress) curved (pocket) shearling (coat) flamenco (skirt) (dress) bomber (jacket) kimono (sleeve) bodycon (dress) (skirt) napoleon (lapel) accordion (skirt) oversized (lapel) mandarin (collar) oversized (collar) bermuda (shorts) peter pan (collar) bear (coat) kangaroo (pocket) stand-away (collar) asymmetric (collar) hip-huggers (pants) leg of mutton (sleeve) classic military (jacket) circular flounce (sleeve) dropped-shoulder sleeve (sleeve), batwing (sleeve) (a)

Fig. 9: Mask counts per attributes in training data, grouped by super categories. Best viewed digitally 10 M. Jia et al.

Apparel attributes (length) Apparel attributes (neckline type)

104

103 Counts Counts 103

102 midi v-neck u-neck sleeveless collarless hip (length) wrist-length mini (length) elbow-length oval (neck) floor (length) high (neck) boat (neck) cowl (neck) knee (length) maxi (length) crew (neck) short (length) micro (length) turtle (neck) round (neck) scoop (neck) halter (neck) one shoulder (neck) illusion (neck) keyhole (neck) surplice (neck) off-the-shoulder crossover (neck) square (neckline) queen anne (neck) plunging (neckline) three quarter (length) above-the-hip (length) straight across (neck) sweetheart (neckline) below the knee (length) above-the-knee (length) asymmetric (neckline) (a) (b)

Apparel attributes (opening type) Apparel attributes (waistline)

104 104

103

3

Counts 10 Counts

102 zip-up lace up low wrapping high waist no opening no waistline fly (opening) normal waist single breasted double breasted empire waistline dropped waistline toggled (opening) buckled (opening) chained (opening) (wasitline) (c) (d)

Apparel attributes (textile pattern) Apparel attributes (textile finishing, manufacturing techniques)

104 104

103

103 Counts Counts

102

102 dot slit floral plant zebra stripe lining check tiered argyle giraffe cutout frayed paisley quilted ruched fair isle printed leopard rivet(a) cartoon washed bead(a) chevron cheetah burnout abstract peacock sequin(a) smocking gathering geometric distressed embossed perforated applique(a) camouflage toile de jouy plain (pattern) letters, numbers snakeskin (pattern) herringbone (pattern) houndstooth (pattern) no special manu_tech* (e) (f)

Apparel attributes (non-textile material type) Apparel attributes (silhouette) 105

104

104 103

2 103 10 Counts Counts

101

102 100 fur gem ivory bone wood metal suede plastic rubber feather crocodile shearling snakeskin bell peg tent flare circle a-line pencil baggy balloon peplum bootcut straight trumpet wide leg high low tight (fit) mermaid loose (fit) oversized curved (fit) regular (fit) bell bottom fit and flare symmetrical asymmetrical , no non-textile material (g) (h)

Fig. 9: Mask counts per attributes in training data, grouped by super categories (cont.). Best viewed digitally Fashionpedia 11

product_12 product_19 product_8 Apparel Categories Mini Circle product_9 (length) Fine-grained Attributes product_13 CONTAIN Plain Above product_15 Distressed / (pattern) CONTAIN CONTAIN CONTAIN Ripped -the Dolman CONTAIN product_6 (Sleeve) & CONTAIN CONTAIN CONTAIN -Knee CONTAIN CONTAIN

CONTAIN CONTAIN Symmetr CONTAIN CONTAIN CONTAIN Low CONTAINCONTAIN CONTAIN -ical CONTAIN waist / CONTAIN Regular (fit)CONTAIN CONTAIN

CONTAIN CONTAIN CONTAIN Normal HAS_ATTRI…

CONTAIN CONTAIN CONTAIN CONTAIN Circle Washed CONTAIN HAS_ATTRIBUTE Waist HAS_ATTRIBUTE Perfo CONTAIN HAS_ATTRIBUTE CONTAIN CONTAIN CONTAIN Skater CONTAIN -rated CONTAIN (Skirt) CONTAIN CONTAIN Distressed / CONTAIN CONTAIN CONTAIN CONTAIN CONTAIN Ripped CONTAIN Flare CONTAIN CONTAIN Mini Fly HAS_ATTRIBUTE CONTAIN CONTAIN Washed HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN CONTAIN (length) /

(Opening) HAS_ATTRIBUTE CONTAIN CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Above CONTAIN CONTAIN CONTAIN HAS_ATTRIBUTE Dropped-sh…CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE -the-k… CONTAIN HAS_ATTRIBUTE Flare Three HAS_ATTRIBUTE Shorts CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE product_10 CONTAIN HAS_ATTRIBUTE quarterCONTAIN Symm HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE CONTAIN Loose (fit) HAS_ATTRIBUTE HAS_ATTRIBUTE -etrical HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Biker HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE SlitHAS_ATTRIBUTEHAS_ATTRIBUTE Symmet CONTAIN (Jacket) / HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTECONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE Fly Skater HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE -rical CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTECONTAIN Normal HAS_ATTRIBUTE HAS_ATTRIBUTE Curved HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE (Opening) (Pocket) HAS_ATTRIBUTE HAS_ATTRIBUTE (Skirt) Waist / HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE product_20 HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Ringer HAS_ATTRIBUTE HAS_ATTRIB… HAS_ATTRIBUTE HAS_ATTRIBUTE Check / Plaid HAS_ATTRIBUTE / Tartan (T-shirt) HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE Napoleon Regular HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE Skirt CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE (Lapel) CONTAIN CoatHAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE (fit) CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Round HAS_ATTRIBUTE CONTAIN CONTAIN (Neck) / HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE Roundneck HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Stripe HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE Skirt HAS_ATTRIBUTE Quilted HAS_ATTRIBUTE product_11 HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE ShortsHAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Straight HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Zip-up HAS_ATTRIBUTE Bolero HAS_ATTRIBUTE HAS_ATTRIBUTE Tight (fit) / HAS_ATTRIBUTE CONTAIN CropHAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN Slim (fit) / CONTAIN CONTAIN HAS_ATTRIBUTE Jacket / BlazerHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Skinny (fit) HAS_ATTRIBUTE HAS_ATTRIBUTE Three One CONTAIN (Jacket) HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE Above CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN CONTAINCONTAIN CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE shoulder HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN quarter CONTAIN Puffer -the HAS_ATTRIBUTE HAS_ATTRIBUTE Tops / T-Shirt HAS_ATTRIBUTE

CONTAIN HAS_ATTRIBUTE

CONTAIN HAS_ATTRIBUTE (Jacket) / HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN (length) / HAS_ATTRIBUTE HAS_ATTRIBUTE Down Above- HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTEHAS_ATTRIBUTE -hip Collarless HAS_ATTRIBUTEDress (Jacket) Sweatshirt HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE H… CONTAIN HAS_ATTRIBUTE CONTAIN Symmet CONTAIN the-… HAS_ATTRIBUTE Wrist- CONTAIN Straight HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE across HAS_ATTRIBUTE HAS_ATT… HAS_ATTRIBUTE -rical (Neck) HAS_ATTRIBUTE HAS_ATTRIBUTE length CONTAIN HAS_ATTRIBUTE Jacket HAS_ATTRIBUTE HAS_ATTRIBUTE Napoleon HAS_ATTRIBUTECONTAIN HAS_ATTRIBUTE Peplum HAS_ATTRIBUTE HAS_ATTRIBUTEHAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN (Lapel) HAS_ATTRIBUTE CONTAIN / Blazer Pencil HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE

HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE HAS_ATTRIBUTE Plain CONTAIN CONTAIN Tiered / HAS_ATTRIBUTE CONTAIN CONTAIN CONTAINOff-the (pattern)HAS_ATTRIBUTE HAS_… layered HAS_ATTRIBUTE CONTAIN product_17 HAS_ATTRIBUTE HAS_ATTRIBUTE

CONTAIN CONTAIN -shoulder CONTAIN Wrist

CONTAIN

HAS_ATTRIBUTE CONTAIN CONTAIN Regular CONTAIN HAS_ATTRIBUTE HAS_ATTRIBUTE product_4 CONTAIN CONTAIN HAS_ATTRIBUTE -length CONTAIN

HAS_ATTRIBUTE HAS_ATTRIBUTE (fit) HAS_A… CONTAIN CONTAIN CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN HAS_ATTRIBUTECONTAIN Set-in HAS_ATTRIBUTE Hip CONTAIN CONTAIN CONTAIN CONTAIN CONTAIN Sleeveless Leather / CONTAIN PleatCONTAIN / Faux-leather (length) HAS_ATTRIBUTE CONTAIN sleeve CONTAIN

CONTAIN pleated

CONTAIN CONTAIN CONTAIN product_1 HAS_ATTRIBUTE CONTAIN

CONTAIN CONTAIN Plain CONTAIN CONTAIN Hip (pattern) Set-in CONTAIN CONTAIN CONTAIN (length) CONTAIN Shirt sleeve EmpireCONTAIN Set-in CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN Biker (Collar)CONTAIN Waistline CONTAIN CONTAIN sleeve CONTAIN CONTAIN CONTAIN Oversized CONTAIN CONTAIN (Jacket) CONTAIN (Collar) Fur CONTAIN product_14 CONTAIN CONTAIN HAS_ATTRIBUTE CONTAIN CONTAIN CONTAIN CONTAIN CONTAIN Asym CONTAIN product_18 CONTAIN CONTAIN -metrical CONTAIN CONTAIN product_5 Collarless

CONTAIN

CONTAIN Coat CONTAIN HAS_ATTRIBUTE Zip-up Leather Asymmet Fur product_2 -rical product_3 product_16 product_7

Fig. 10: Fashionpedia Knowledge Graph: we present a subset of the Fashionpedia knowledge graph by aggregating 20 annotated products. The knowledge graph can be used as a tool for generating structural information

We also report the distribution of number of vertices per polygons in Fig. 13. This measures the annotation effort in mask annotation. Fashionpedia has the second-widest range, next to LVIS.

1.4 Fashionpedia dataset creation details Image collection. To avoid photo bias, all the images are randomly collected from Flickr and free license stock photo websites (Unsplash, Burst by Shopify, Free stocks, Kaboompics, and Pexels). The collected images are further verified manually by two fashion experts. Specifically, they check the scenes’ diversity

Table 1: Percentage of attributes in Fashionpedia broken down by super-class. “Tex finish, manu-tech.” is short for “Textile finishing, Manufacturing tech- niques”. Summaries of “not sure” and “not on the list” during attributes an- notations are also presented. It was calculated by the counts divided by the total masks with attributes. “not sure” is mainly due to occlusion inside the images, which cause some super-classes (such as waistline, opening type, and length) are unidentifiable in the images. The percentage of “not on the list” is less than 15%. This demonstrates the comprehensiveness of our Fashionpedia ontology Super-category class count not sure not on the list Length 15 12.79% 0.01% Nickname 153 9.15 % 12.76% Opening Type 10 32.69% 3.90% Silhouettes 25 2.90% 0.27% Tex finish, manu-tech 21 4.47% 1.34% Textile Pattern 24 2.18% 5.30% None-Textile Type 14 4.90% 4.07% Neckline 25 9.57% 3.38% Waistline 7 30.46% 0.17% 12 M. Jia et al.

&#%'()*'

=/(('$92?"*

K9"--2#3 <*"%%'% ID?7'E ;,"$ +,-% =7,9$% .*,,90*'#3$7 F,,%'01G2$4 5)%$9"?$01-"$$'9#4

!92#$'E J2#20*'#3$7 =*2(01G2$4 .*/01,-'#2#34 !*"2# =*2(01G2$4

K"%7'E IDGG*' IDGG*' IDGG*' 8'?6*2#' IDGG*' !,?6'$ .9"/'E

&#%'()*'

;"-' .*/01,-'#2#34 +,-% =7,'% !*"2# !"#$% 5#6*'0*'#3$7 !*"2# 8,9("*0:"2%$ =*2(01G2$40 <*"%%'% 5%/(('$92?"* =/(('$92?"* !*"2# =2#3*'C)9'"%$'E

=/(('$92?"* ;,**"9 =*''>' =*''>' !,?6'$ !,?6'$

&#%'()*'

B'*,:C$7'C6#''

A9'%% =7,'% =?"9G <*"%%'% A,D)*'0)9'"%$'E !"#$/7,%' @"?6'$

F,,%'01G2$4 ;7'?60H0!*"2E0

I'3D*"901G2$4 =/(('$92?"* -*"2#

=*''>' =*''>' F"-'*

!"#$%&'()*&+),-- !"9$0,G

+'L$2*'0.2#2%72#3 +'L$2*'0!"$$'9# =2*7,D'$$' M-'#2#30+/-'

F'#3$7 82?6#"(' K"2%$*2#'

Fig. 11: Example images and annotations from our dataset: the images are anno- tated with both instance segmentation masks and fine-grained attributes (black boxes) Fashionpedia 13

Fig. 12: Image size comparison among Fashionpedia, ModeNet, DeepFashion2, and COCO2017, LVIS. Only training images are shown. The Fashionpedia im- ages has the most diverse resolutions. Note that COCO2017 and LVIS have higher resolution images for annotation. The distribution presented here are the publicly available photos

DeepFashion2 ModaNet 101 COCO2017 LVIS Fashionpedia 10 1

10 3 Percent of polygon

0 250 500 750 1000 1250 1500 1750 Number of vertices per polygon

Fig. 13: The number of vertices per polygon. This represents the quality of masks and the efforts of annotators. Values in the x-axis were discretized for better visual effect. Y-axis is on log scale. Fashionpedia has the second widest range, next to LVIS 14 M. Jia et al. and make sure the clothing items were visible and annotatable in the images. The estimated image type breakdown is listed as follows: street style images (30% of the full dataset), celebrity events images (30%), runway show images (30%), and online shopping product images (10%). For gender distribution, the gender in 80% of images are female, and 20% of images are male.

Fig. 14: Example of different type (people position/gesture, full/half shot, occlu- sion, scenes, garment types, etc.) of images in Fashionpedia dataset

We did aim to address the issue of photographic bias in the image collection process. Our dataset includes the images that are not centered, not full shot and with occlusion (see examples in Fig. 14). Furthermore, our focus is to identify clothing items, not to identify people during the image collection process. Crowd workers and 10-day training for mask annotation. In the spirit of sharing the same apparel vocabulary for all the annotators, we prepared a detailed tutorial (with text descriptions and image examples) for each category and attributes in the Fashionpedia ontology (see Fig. 15 for an example). Before the official annotation process, we spent 10 days on training the 28 crowd workers for the following three main reasons. First, some apparel categories are commonly referred as other names in gen- eral. For example, “top” is a general term for “shirt”, “sweater”, “t-shirt”, “sweatshirt”. Some annotators can mistakenly annotate a “shirt” as a “top”. We need to train these workers so they have the same understanding of the proposed Fashionpedia ontology. Utilizing the prepared tutorials (see Fig. 15 for an example), we trained and educated annotators on how to distinguish among different apparel categories. Second, there are fine-grained differences among apparel categories. For ex- ample, we observed that some workers initially had difficulty in understanding Fashionpedia 15

Fig. 15: Annotation tutorial example for shirt and top the difference among different garment parts, such as ‘tassel’ and ‘fringe’. To help them understand the difference of these objects, we ask them to practice and identify more sample images before the annotation process. Fig. 16 shows our tutorials for these two categories. We specifically shows some correct and wrong examples of annotations. Third, we ask for the quality of annotations. In particular, we ask the an- notators to follow the contours of garments in the images as closely as possible. The polygon annotation process is monitored and verified for a few days before the workers started the actual annotation process. Quality control of debatable apparel categories. During the annotation process, we allow annotators to ask questions about the uncertain categories. Two fashion experts monitored the annotation process by answering questions, checking the annotation quality, and providing weekly feedback to annotators. 16 M. Jia et al.

!"#$%&'%()&*+,-%%%%%%% ./)),0$%1%!)/*+%2#3%$/%$)#0,%()&*+,

!"#$%&'%4#'',5-%%%%%%% ./)),0$%1%!)/*+%2#3%$/%$)#0,%4#'',5

Fig. 16: Annotation tutorial for fringe and tassel

Instead of asking annotators to rate their confidence level of each segmenta- tion mask, we asked them to send back all the uncertain masks to us during the annotation. The same two fashion experts made the final judgement and gave the feedback to the workers on these debatable or unsure fashion categories. Some examples of debatable or fuzzy fashion items that we have documented can be found in Figure 17.

1.5 Discussion

Does this dataset include the images or labels of previous datasets? We only include the previous datasets for comparison. Our dataset doesn’t in- tentionally use any images or labels from previous datasets. All the images and labels from Fashionpedia are newly collected and annotated. Who were the fashion experts annotating localized attributes in Fashionpedia dataset? The fashion experts are the 15 fashion graduate stu- dents that we recruited from one of the top institutes. For double- blind policy, we cannot mention the name of the university. But we will release the name of this university and the collaborators from this university in the final version of this paper. Instance segmentation v.s. semantic segmentation. We didn’t con- duct semantic segmentation experiments on our dataset for the following two reasons: 1) Although semantic segmentation is a useful task, we believe instance segmentation is more meaningful for fashion images. For example, if we need to distinguish the different shoe style of a fashion image containing 3 pair of dif- ferent shoes, instance segmentation (Figure 18(a)) can help us distinguish each shoe separately. However, semantic segmentation (Figure 18(b)) will mix all the Fashionpedia 17

Fig. 17: Example of debatable fashion items in Fashionpedia dataset. The ques- tions are asked by the crowdworkers. The answers are provided by two fashion experts

Fig. 18: Instance segmentation (left) and semantic segmentation (right)

shoe instances together. 2) Semantic segmentation is the sub-problem of instance segmentation. If we merge the same detected object class from our instance seg- mentation experimental result, it yields the results for semantic segmentation. Which image has the most annotated masks? In Fashionpedia dataset, the maximum number of segmentation masks in an image is 74 (Fig. 19). We find that most of the masks are belonging to “rivets” (garment parts). What’s the difference between Fashionpedia and other fine-grained datasets like CUB-200? We propose to localize fine-grained attributes within segmentation masks of images. This is a novel task with real-world application to the best of our knowledge. The differences between Fashionpedia and CUB are as follows: 1) CUB uses keypoints as annotation to indicate different loca- tions on birds, while Fashionpedia has segmentation masks of garments, garment parts, and accessories; 2) Fashionpedia attributes are associated with garment or garment part instances in images, whereas CUB provides global attributes, not associated with any specific keypoints. !"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479!"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#4795%#:#5%#:#

!"#$#%&'(#)&$*!"#$#%&'(#)&$* !"#$#%&'(#)&$*(+#,-()&./.!"#$#%&'(#)&$*(+#,-()&./.

0*,&#'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.80*,&#'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8

!"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479 5%#:#

18 M. Jia et al. iMat-Fashion Kaggle challenges To advance state-of-the-art of visual anal- ysis of clothing, we hosted two kaggle challenges (imaterialist-fashion) on Kaggle in 2019 1 and 2020 2 respectively.

1 https://www.kaggle.com/c/imaterialist-fashion-2019-FGVC6!"#$#%&'(#)&$* !"#$#%&'(#)&$*(+#,-()&./. 2 https://www.kaggle.com/c/imaterialist-fashion-2020-fgvc70*,&#'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8 !"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479 5%#:#

!"#$%"&'#(%$$)!*+$'#($)%!",-./01#2"34!5%6'4,748%#479 5%#:#

!"#$#%&'(#)&$*73(8(*')$(.'8#!"#$#%&'(#)&$*(+#,-()&./. 73(8(*')$(.'8#$1(&"$.'/0/ 0*,&#'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8!"#$%#&'()#%$(*+,$,+$#'-"$.'/0$1(&"$'//,-('&#%$),-')(2#%$'&&3(45&#/6 !"#$#%&'(#)&$* !"#$#%&'(#)&$*(+#,-()&./.

0*,&#'*1(#%23(32(*&4-()&./(+#,-(&..34#&,*1('34&'#5*1(&,,"#67,*.8

Fig. 19: The image with 74 masks in Fashionpedia dataset Fashionpedia 19

References

1. Vrandeˇci´c,D., Kr¨otzsch, M.: Wikidata: a free collaborative knowledge base (2014) 9 2. ZARA.com: Zara white leather flat ankle with top stitching size 5 bnwt, retrieved May 9, 2019 from https://www.zara.com 6