A Machine Learning Approach for Product Matching and Categorization Use Case: Enriching Product Ads with Semantic Structured Data

Undefined 0 (2016) 1–0 1 IOS Press A Machine Learning Approach for Product Matching and Categorization Use case: Enriching Product Ads with Semantic Structured Data Petar Ristoski a, Petar Petrovski a, Peter Mika b, Heiko Paulheim a, a Data and Web Science Group, University of Mannheim, B6, 26, 68159 Mannheim E-mail: [email protected],[email protected], [email protected] b Yahoo Labs, London, UK E-mail: [email protected] Abstract. Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed. In this paper, we present an approach that leverages neural language models and deep learning techniques in combination with standard classification approaches for product matching and categorization. In our approach we use structured product data as supervision for training feature extraction models able to extract attribute-value pairs from textual product descriptions. To minimize the need for lots of data for supervision, we use neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances. Furthermore, we use a deep Convolutional Neural Network to produce image embeddings from product images, which further improve the results on both tasks. Keywords: Product Data, Data Integration, Vector Space Embeddings, Deep Learning, Microdata, schema.org 1. Introduction best prices for the products they search for. To offer a better user experience, there are many products aggre- In recent years, with the advancements of the In- gator sites, like Google Product Search1, PriceGrab- ternet and e-business services, the amount of products ber2, and Shopzila3, trying to integrate and categorize sold through e-shops has grown rapidly. A recent study products from many e-shops and many different mer- estimates that the total e-commerce retail sales for the chants. However, the task of identifying offers of the fourth quarter of 2015 in the USA only were 89:1 bil- same product across thousands of e-shops and inte- lion dollars [5]. Yet, there still is one big issue in the grating the information into a single representation is process of product search and purchase that consumers not trivial. Furthermore, to allow users better naviga- have to deal with. The same product may be found on tion and product search, product aggregator sites have many different e-shops, but the information about the to deal with the task of organizing all the products ac- products offers greatly differ across different e-shops. cording to a product taxonomy. Furthermore, there are no global identifiers for products, and offers are most often not interlinked among 1http://www.google.com/shopping each other. Therefore, there is no simple way for the 2http://www.pricegrabber.com/ consumers to find all the necessary information and 3http://www.shopzilla.com/ 0000-0000/16/$00.00 c 2016 – IOS Press and the authors. All rights reserved 2 A Machine Learning Approach for Product Matching and Categorization To support the integration process, e-shops are in- tion 3 and section 4, we introduce our methodology creasingly adopting semantic markup languages such for product matching and categorization, respectively. as Microformats, RDFa, and Microdata, to annotate In section 5, we present the results of matching un- their content, making large amounts of product de- structured product descriptions, followed by the evalu- scription data publicly available. In this paper, we ation of the product ads enrichment with metadata ex- present an approach that leverages neural language tracted from HTML annotations in section 7. In sec- models and deep learning techniques in combination 8 we present the results of the product categorization with standard classification approaches for prod- tion approach. We conclude with a summary and an uct matching and categorization from HTML annota- outlook on future work. tions. We focus on data annotated with the Microdata markup format using the schema.org vocabulary. Re- cent works [26,27] have shown that the Microdata for- 2. Related Work mat is the most commonly used markup format, with highest domain and entity coverage. Also, schema.org Both product matching and product categorization is the most frequently used vocabulary to describe for the Web have been explored with various ap- products. Although considered structured annotations, proaches and methods within the last years. empirical studies [26,33] have shown that the vast ma- jority of products are annotated only with a name and 2.1. Product matching a textual description, i.e., they are rather unstructured than structured data. This helps identifying the rele- Since there are no global identifiers for products, vant parts on the Website, but leads to rather shallow and links between different e-commerce Web pages textual data, which has to be tackled with methods for are also scarce, finding out whether two offers on dif- unstructured data. ferent Web pages are referring to the same product is In a previous work [36], we have proposed an ap- a non-trivial task. Therefore, product matching deals proach for enriching structured product ads with data with identifying pairs or sets of identical products. extracted from HTML pages that contain semantic an- Ghani et al. [7] first presented enriched product notations. The approach is able to identify matching databases with attribute-value pairs extracted from products in unstructured product descriptions using the product descriptions on the Web, by using Naive Bayes database of structured product ads as supervision. We in combination with a semi-supervised co-EM algo- identified the Microdata dataset as a valuable source rithm to extract attribute-value pairs from text. An for enriching existing structured product ads with new evaluation on apparel products shows promising re- attributes. sults, however, the system is able to extract attribute- In this paper, we enhance the existing approach us- value pairs only if both the attribute name (e.g., ing neural language models and deep learning tech- “color”) and the attribute value (e.g., “black”) appear niques. We inspect the use of different models for three in the text. tasks: (1) Matching products described with structured The XploreProducts.com platform detailed in [39] annotations from the Web, (2) enriching an existing integrates products from different e-shops annotated product database with product data from the Web, and using RDFa annotations. The approach is based on (3) Categorizing products. For those tasks, we em- several string similarity functions for product match- ploy Conditional Random Fields for extracting prod- ing. The approach is extended by using a hybrid simi- uct attributes from textual descriptions, as well as Con- larity method and hierarchical clustering for matching volutional Neural Networks to produce embeddings products from multiple e-shops [2]. of product images. We show that neural word em- Kannan et al. [12] use the Bing products catalog beddings outperform baseline approaches for product to build a dictionary-based feature extraction model. matching and produce results comparable to those of Later, the features of the products are used to train a supervised approaches for product categorization. Fur- Logistic Regression model for matching product of- thermore, we show that image embeddings can only fers to the Bing shopping data. Another machine learn- provide a weak signal for product matching, but a ing approach for matching products data is proposed strong signal for product classification. in [15]. First, several features are extracted from the The rest of this paper is structured as follows. In sec- title and the description of the products using man- tion 2, we give an overview of related work. In sec- ually written regular expressions. In contrast, named A Machine Learning Approach for Product Matching and Categorization 3 entity recognition based feature extraction models are The approach is evaluated on the Abt-Buy dataset4 and developed in [24] and [36]. Both approaches use a a small custom dataset. CRF model for feature extraction, however [24] has a While the discussed studies above have imple- limited ability to extract explicit attribute-value pairs, mented diverse approaches (classifiers, genetic pro- which is improved upon in [36]. Comparably, in this gramming, string similarities), the feature extraction paper we enhance the CRF model with continues fea- techniques used are mostly dependent on a supervised tures, thus boosting the CRF’s ability to recognize a dictionary-based approach. larger number of attribute-value pairs. To reduce the labeling effort required by supervised The first approach to perform products matching approaches, in this paper we rely on neural word em- on Microdata annotations is presented in [33], based beddings and CNNs, however, a full review of deep on the Silk rule learning framework [11]. To do so, learning and neural networks in general is beyond the different combinations of features (e.g. bag of words, scope of this paper; please see [37]. Instead, we pro- dictionary-based, regular expressions etc.) from the vide a review of the most relevant work for product feature extraction that use neural word embeddings product descriptions are used. The work has been ex- and deep learning. tended in [34], where the authors developed a genetic Recently, a handful of approaches employ word em- algorithm for learning regular expressions for extract- beddings for getting features from product data for the ing attribute-value pairs from products. problem of matching, as well as other problems con- The authors of [31] perform product matching on a cerning product data.

Load more