Web-Scale Responsive Visual Search at Bing
Total Page:16
File Type:pdf, Size:1020Kb
Web-Scale Responsive Visual Search at Bing Houdong Hu, Yan Wang, Linjun Yang, Pavel Komlev, Li Huang, Xi (Stephen) Chen, Jiapei Huang, Ye Wu, Meenaz Merchant, Arun Sacheti Microsoft Redmond, Washington {houhu,wanyan,linjuny,pkomlev,huangli,chnxi,jiaphuan,wuye,meemerc,aruns}@microsoft.com ABSTRACT build a relevant, responsive, and scalable web-scale visual search In this paper, we introduce a web-scale general visual search system engine. To the best of the authors’ knowledge, this is the first work deployed in Microsoft Bing. The system accommodates tens of introducing a general web-scale visual search engine. billions of images in the index, with thousands of features for each Web-scale visual search engine is more than extending existing image, and can respond in less than 200 ms. In order to overcome visual search approaches to a larger database. Here web-scale means the challenges in relevance, latency, and scalability in such large the database is not restricted to a certain vertical or a certain web- scale of data, we employ a cascaded learning-to-rank framework site, but from a spider of a general web search engine. Usually the based on various latest deep learning visual features, and deploy in database contains tens of billions of images, if not more. With this a distributed heterogeneous computing platform. Quantitative and amount of data, the major challenges come from three aspects. First, qualitative experiments show that our system is able to support a lot of approaches that may work on a smaller dataset become im- various applications on Bing website and apps. practical. For example, Bag of Visual Words [17] generally requires the inverted index to be stored in memory for efficient retrieval. KEYWORDS Assuming each image has merely 100 feature points, the inverted index will have a size of about 4 TB, letting alone the challenges in Content-based Image Retrieval, Image Understanding, Deep Learn- doing effective clustering and quantization to produce reasonable ing, Object Detection visual words. Second, even with proper sharding, storage scalability ACM Reference Format: is still a problem. Assuming we only use one single visual feature, Houdong Hu, Yan Wang, Linjun Yang, Pavel Komlev, Li Huang, Xi (Stephen) the 4096-dimension AlexNet [11] fc7 feature, and shard the feature Chen, Jiapei Huang, Ye Wu, Meenaz Merchant, Arun Sacheti. 2018. Web- storage to 100 machines, each machine still needs to store 1:6 TB Scale Responsive Visual Search at Bing. In Proceedings of (KDD). ACM, New York, NY, USA, Article 4, 9 pages. https://doi.org/10.475/123_4 of features. Note these features cannot be stored in regular hard disks, otherwise the low random access performance will make the 1 INTRODUCTION latency unacceptable. Third, modern visual search engines usually use a learning-to-rank architecture [3, 10] to utilize complimentary Visual search, or Content-based Image Retrieval, is a popular and information from multiple features to obtain the best relevance. In long-standing research area [1, 12, 18, 23, 25]. Given an image, a web-scale database, this posts another challenge on latency. Even a visual search system returns a ranked list of visually similar the retrieval of the images can be parallelized, the query feature images. It associates a query image with all known information of extraction, data flow control among sharded instances, and final the returned images, and thus can derive various applications, for data aggregation all require both sophisticated algorithms and en- example, locating where a photo was taken [5], and recognizing gineering optimizations. In addition to the aforementioned three fashion items from a selfie [14]. Therefore, it is also of great interest difficulties from the index size, there is another unique challenge in the industry. for general search engine. The vertical-specific search engine usu- Relevance is the main objective and metric of visual search. With ally has a controlled image database with well organized metadata. the recent development of deep learning [20], visual search systems However, this is not the case for general visual search engine, where have also got a boost on relevance, and have become more read- the metadata is often unorganized, if not unavailable. And this puts arXiv:1802.04914v2 [cs.CV] 20 Feb 2018 ily available for general consumers. There has been exploration more emphasis on the capability to understand the content of an on visual search systems from industry players [23, 25], but the image. works focused more on the feasibility of vertical-specific systems, In other words, it is very challenging to achieve high relevance, e.g. images on Pinterest or eBay, and lacked discussions on more low latency, and high storage scalability at the same time in a advanced targets such as relevance, latency, and storage. In this web-scale visual search system. We propose to solve the dilemma paper, we would like to provide an overview of the visual search with smart engineering trade-offs. Specifically, we propose to use system in Microsoft Bing, hoping to provide insights on how to a cascaded learning-to-rank framework to trade off relevance and Permission to make digital or hard copies of part or all of this work for personal or latency, employ Product Quantization (PQ) [4] to trade off between classroom use is granted without fee provided that copies are not made or distributed relevance and storage, and use distributed Solid State Drive (SSD) for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. equipped clusters to trade off between latency and storage scalabil- For all other uses, contact the owner/author(s). ity. In the cascaded learning-to-rank framework, a sharded inverted KDD, 2018, London, United Kingdom index is first used to efficiently filter out “hopeless” database images, © 2018 Copyright held by the owner/author(s). ACM ISBN 123-4567-24-567/08/06...$15.00 and generate a list of candidates. And then more descriptive and https://doi.org/10.475/123_4 also more expensive visual features are used to further rerank the KDD, 2018, London, United Kingdom H. Hu et al. Figure 1: Example user interfaces of the visual search system at Bing. The left figure shows the desktop experience, where detected objects are shown as hotspots overlaid on the images. Users are able to click a spot or specify their own crop box to get the visually similar products or images. The right figure shows the mobile experience, where related products and images are shown for a query image of sunglasses. Figure 2: Workflow overview of the web-scale visual search system in Bing. The query image is first processed tobetrans- formed into a feature vector, and then goes through a three-level cascaded ranker framework. The result image list is returned to the user after postprocessing. More details are available in Section 2. candidates, the top of which is then passed to the final level ranker. we will introduce the details together with engineering implementa- In the final level ranker, full-fledged visual features are retrieved tion of the system in Section 3, followed by applications introduced and fed into a Lambda-mart ranking model [2] to obtain similarity in Section 4. Section 5 will provide quantitative and qualitative score between the query and each candidate image, based on which results of the proposed system in terms of relevance, latency and the ranked list is produced. storage scalability, with conclusions in Section 6. The remaining part of the paper is organized as follows. In Sec- tion 2, we will give a workflow overview of the entire system. Then Web-Scale Responsive Visual Search at Bing KDD, 2018, London, United Kingdom Figure 3: The application scenarios of the DNN models used in the proposed system. 2 SYSTEM OVERVIEW remove duplicates and adult contents as needed. This final result Before diving into how the system is built, let us first introduce the set is then returned to the user. workflow. When a user submits a query image that he/she finds Model training: Multiple models used in the retrieval process on the Web or takes by a camera in Bing Visual Search, visually require a training stage. First, several DNN models are leveraged in similar images and products will be identified for the user to explore our system to improve the relevance. Each DNN model individually or purchase (examples are shown in Figure 1). Bing Visual Search provides complementary information due to different training data, system comprises three major stages/components as summarized network structures and loss functions. Second, a joint k-Means below, and Figure 2 illustrates the general processing workflow of algorithm [22] is utilized to build the inverted index in Level-0 how a query image turns to the final result images. matching. Third, PQ is employed to improve the DNN model serving Query understanding: We extract a variety of features from a latency without too much relevance compromise. We also take query image to describe its content, including deep neural network advantage of object detection models to improve user experiences. (DNN) encoders, category recognition features, face recognition Details of these models are introduced in the following sections. features, color features and duplicate detection features. We also generate an image caption that can identify the key concept in the 3 APPROACH query image. Scenario triggering model is then called to determine In this section, we will cover the details of how we handle relevance, whether to invoke different scenarios in visual search. For instance, latency and storage scalability, including some extra features such when a shopping intent is detected from the query, searches are as object detection.