Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Towards Better Understanding the Clothing Fashion Styles: A Multimodal Deep Learning Approach Yihui Ma,1 Jia Jia,1∗ Suping Zhou,3 Jingtian Fu,12 Yejun Liu,12 Zijian Tong4 1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Key Laboratory of Pervasive Computing, Ministry of Education Tsinghua National Laboratory for Information Science and Technology (TNList) 2Academy of Arts & Design, Tsinghua University, Beijing, China 3Beijing University of Posts and Telecommunications, Beijing, China 4Sogou Corporation, Beijing, China [email protected] Abstract Lin 2015) builds an integrated application system to parse a set of clothing images jointly. Besides, people try to analyse In this paper, we aim to better understand the clothing fashion visual features by adding occasion and scenario elements. styles. There remain two challenges for us: 1) how to quan- (Liu et al. 2012) considers occasions in dressing and focus titatively describe the fashion styles of various clothing, 2) how to model the subtle relationship between visual features on scenario-oriented clothing recommendation. Although a and fashion styles, especially considering the clothing collo- latest work (Jia et al. 2016) proposes to appreciate the aes- cations. Using the words that people usually use to describe thetic effects of upper-body menswear, it still lacks univer- clothing fashion styles on shopping websites, we build a sality and ignores that the collocation of top and bottom Fashion Semantic Space (FSS) based on Kobayashi’s aesthet- has a significant impact on fashion styles. Thus, there still ics theory to describe clothing fashion styles quantitatively remain two challenges for us: 1) how to quantitatively de- and universally. Then we propose a novel fashion-oriented scribe the fashion styles of various clothing, 2) how to model multimodal deep learning based model, Bimodal Correlative the subtle relationship between visual features and fashion Deep Autoencoder (BCDA), to capture the internal correlation styles, especially considering the clothing collocations. in clothing collocations. Employing the benchmark dataset we build with 32133 full-body fashion show images, we use In this paper, we aim to better understand the cloth- BCDA to map the visual features to the FSS. The experiment ing fashion styles and propose our solutions from two as- results indicate that our model outperforms (+13% in terms of pects. First, we build a Fashion Semantic Space (FSS) based MSE) several alternative baselines, confirming that our model on the Image-Scale Space in aesthetic area proposed by can better understand the clothing fashion styles. To further (Kobayashi 1995). By computing the semantic distances us- demonstrate the advantages of our model, we conduct some interesting case studies, including fashion trends analyses of ing WordNet::Similarity (Pedersen, Patwardhan, and Miche- brands, clothing collocation recommendation, etc. lizzi 2004), we coordinate the most often used 527 aesthetic words on clothing section of Amazon to FSS. Second, we propose a fashion-oriented multimodal deep learning based 1 Introduction model, Bimodal Correlative Deep Autoencoder (BCDA), to capture the correlation between visual features and fashion What are the most popular clothing fashion styles of this sea- 1 styles by utilizing the intrinsic matching rules of tops and son? Reported by Vogue , romantic, elegant and classic are bottoms. Specifically, we regard the tops and bottoms as two the top trends during the Fall 2016 Couture collection. Ex- modals of clothing collocation, and leverage the shared rep- ploring the fashion runway images, these styles rely heavily resentation of multimodal deep learning to learn the rela- on some specific visual details, such as nipped waist, lapel tionship between the modalities. In addition, we improve the collar, matched with high-waistlines dress or pencil trousers. process of feature learning by taking the clothing categories Since clothing fashion styles benefit a lot from visual de- (e.g. suit, coat, leggings, etc.) as correlative labels. Connect- tails, can we bridge the gap between them automatically? ing BCDA to a regression model, we finally map the visual Many efforts have been made towards this goal. For exam- features to the FSS. Employing 32133 full-body fashion im- ple, (Wang, Zhao, and Yin 2014) presents a method to parse ages downloaded from fashion show websites as our experi- refined texture attribute of clothing, while (Yang, Luo, and mental data, we conduct several experiments to evaluate the ∗Corresponding author: J. Jia ([email protected]) mapping effects between visual features and coordinate val- Copyright c 2017, Association for the Advancement of Artificial ues in FSS. The results indicate that the proposed BCDA Intelligence (www.aaai.org). All rights reserved. model outperforms (+13% in terms of MSE) several alterna- 1http://www.vogue.com/fashion-shows tive baselines. Besides, we also show some interesting cases 38 Figure 1: The workflow of our framework. to demonstrate the advantages of our model. The illustration 2 Related Works of our work is shown in Figure 1. Clothing parsing. Clothing parsing is a popular research We summarize our contributions as follows: topic in the field of computer vision. (Wang, Zhao, and Yin 2014) parses refined clothing texture by exploiting the dis- • We construct a benchmark clothing fashion dataset con- criminative meanings of sparse codes. (Yamaguchi et al. taining 32133 full-body fashion show images from Vogue 2015) tackles the clothing parsing problem using a retrieval- in the last 10 years. The collected dataset is labeled with based approach. Benefiting from deep learning, the perfor- complete visual features (e.g. collar shape, pants length, mance of clothing parsing is promoted in recent years. For color theme, etc.) and fashion styles (e.g. casual, chic, el- example, (Wang, Li, and Liu 2016) use Fast R-CNN for egant, etc.). We are willing to make our dataset open to a more effectively detecting of human body and clothing facilitate other people’s research on clothing fashion2. items. • We build a universal Fashion Semantic Space (FSS) to Clothing recommendation. As people pay more atten- describe clothing fashion styles quantitatively. It is a two- tion to clothing fashion, clothing recommendation becomes dimensional image-scale space containing hundreds of a hot topic. (Liu et al. 2012) considers occasions in dressing words that people often use to describe clothing on shop- and focus on scenario-oriented clothing recommendation. ping websites. Based on the FSS, we can not only do (Jagadeesh et al. 2014) proposes a data-driven model which quantitative evaluation on fashion collocation, but also an- has large online clothing images to build a recommendation alyze the dynamic change of fashion trends intuitively. system. (Hu, Yi, and Davis 2015) proposes a functional ten- sor factorization to build a model between user and clothing. • We propose a fashion-oriented multimodal deep learn- However, since people select clothing by words “romantic” ing based model, Bimodal Correlative Deep Autoencoder or “elegant” rather than visual details, how to bridge the gap (BCDA), connected with regression to implement the task between the visual features and fashion styles is a issue to of mapping visual features to FSS. Specifically, leverag- be resolved. ing the shared representation learned by the multimodal Fashion style modeling. A latest work (Jia et al. 2016) strategy, BCDA can make full use of the internal corre- proposes to appreciate the aesthetic effects of upper-body lation between tops and bottoms, and resolve the issue of menswear. However, clothing variety and fashion colloca- clothing collocation. tion are significant elements of clothing fashion that we can- The rest of paper is organized as follows. Section 2 lists not ignore. How to universally describe fashion styles on related works. Section 3 formulates the problem. Section 4 clothing collocation is still a open problem. presents the methodologies. Section 5 introduces the exper- iment dataset, results and case studies. Section 6 is the con- 3 Problem Formulation clusion. Given a set of fashion images V , for each image t vi ∈ V , we use an Nxt dimensional vector xi = 2 t t t t https://pan.baidu.com/s/1boPm2OB x ,x ,...,x (∀x ∈ [0, 1]) to indicate vi’s top i1 i2 iNxt ij 39 Figure 3: The structure of BCDA. Figure 2: The fashion semantic space. image-scale space (Kobayashi 1995). In order to describe the fashion styles of various clothing, we first observe all the b comments in the last three years from the clothing section (upper-body) visual features, an Nxb dimensional vector xi b b b b of Amazon and split them by words. Then using WordNet = xi1,xi2,...,xiN (∀xij ∈ [0, 1]) to indicate vi’s bot- xb (Miller 1995), only adjectives are retained. Next, we man- tom (lower-body) visual features, an Nct dimensional vec- ually remove those not often used to describe clothing, like t t t t t tor c = c ,c ,...,c (∀c ∈ [0, 1]) to indicate vi’s i i1 i2 iNct ij “happy” or “sad”, getting 527 aesthetic words representing b top clothing categories, and an Ncb dimensional vector ci = fashion styles. To determine the coordinates of these words, b b b b we calculate the semantic distances between keywords and ci1,ci2,...,ciN (∀cij ∈ [0, 1]) to indicate vi’s bottom cb t aesthetic words using WordNet::Similarity (Pedersen, Pat- clothing categories. In addition, X is defined as a |V |∗Nxt t wardhan, and Michelizzi 2004). For an word to be coordi- feature matrix with each element xij denoting the jth top b t b nated, we choose three keywords with the shortest distances, visual features of vi. The definitions of X , C and C are t the weighted arithmetic mean of which can be regarded as similar to X . the coordinate value. In this way, we build the fashion se- Definition.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-