Analyticdb-V: a Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data
Total Page:16
File Type:pdf, Size:1020Kb
AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li, Yuanzhe Cai Alibaba Group fchuangxian.wcx,binwu.wb,sh.wang,json.lrj,lizhe.zcq,lifeifei,yuanzhe.cyzg @alibaba-inc.com ABSTRACT apps. For example, during the 2019 Singles' Day Global With the explosive growth of unstructured data (such as Shopping Festival, up to 500PB unstructured data are in- images, videos, and audios), unstructured data analytics is gested into the core storage system at Alibaba. To facilitate widespread in a rich vein of real-world applications. Many analytics on unstructured data, content-based retrieval sys- database systems start to incorporate unstructured data tems [45] are usually leveraged. In these systems, each piece analysis to meet such demands. However, queries over un- of unstructured data (e.g., an image) is first converted into structured and structured data are often treated as disjoint a high dimensional feature vector, and subsequent retrievals tasks in most systems, where hybrid queries (i.e., involving are conducted on these vectors. Such vector retrievals are both data types) are not yet fully supported. widespread in various domains, such as face recognition [47, In this paper, we present a hybrid analytic engine devel- 18], person/vehicle re-identification [56, 32], recommenda- oped at Alibaba, named AnalyticDB-V (ADBV), to fulfill tion [49], and voiceprint recognition [42]. At Alibaba, we such emerging demands. ADBV offers an interface that en- also adopt this approach in our production systems. ables users to express hybrid queries using SQL semantics Although content-based retrieval system supports unstruc- by converting unstructured data to high dimensional vec- tured data analytics, there are many scenarios where both tors. ADBV adopts the lambda framework and leverages unstructured and structured data shall be jointly queried the merits of approximate nearest neighbor search (ANNS) (we call them hybrid queries) for various reasons. First, techniques to support hybrid data analytics. Moreover, a a query over unstructured data may be inadequate to de- novel ANNS algorithm is proposed to improve the accuracy on scribe the desired objects, where a hybrid query helps im- large-scale vectors representing massive unstructured data. prove its expressiveness. For instance, on e-commerce plat- All ANNS algorithms are implemented as physical operators form like Taobao, one potential customer may search for a in ADBV, meanwhile, accuracy-aware cost-based optimiza- dress with conditions on price (less than $100), shipment tion techniques are proposed to identify effective execution (free-shipping), rating (over 4.5), and style (visually simi- plans. Experimental results on both public and in-house lar to a dress worn by a movie star). Second, the accu- datasets show the superior performance achieved by ADBV racy of state-of-the-art feature vector extraction algorithms and its effectiveness. ADBV has been successfully deployed is far from satisfactory, especially on large datasets, where on Alibaba Cloud to provide hybrid query processing ser- a hybrid query helps to improve the accuracy. For exam- vices for various real-world applications. ple, the false-negative rate of face recognition increases by 40 times when the number of images scales from 0.64 mil- PVLDB Reference Format: lion to 12 million [14]. Therefore, imposing constraints on Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun structured attributes (such as gender, age, image captur- Zhan, Feifei Li, Yuanzhe Cai. AnalyticDB-V: A Hybrid Analyt- ical Engine Towards Query Fusion for Structured and Unstruc- ing locale, timestamp in this context) can narrow down the tured Data. PVLDB, 13(12): 3152-3165, 2020. vector search space and effectively improve the accuracy. In DOI: https://doi.org/10.14778/3415478.3415541 summary, the hybrid query is of great value to a vast number of emerging applications. 1. INTRODUCTION However, most existing systems do not provide native sup- port for hybrid queries. Developers have to rely on two sep- Massive amounts of unstructured data, such as images, arate engines to conduct hybrid query processing: a vector videos, and audios, are generated each day due to the preva- similarity search engine ([25,8, 54]) for unstructured data lence of smartphones, surveillance devices, and social media and a database system for structured data. This practice has inherent limitations. First, we have to implement extra logic and post-processing step atop two systems to ensure data This work is licensed under the Creative Commons Attribution- consistency and query correctness. Second, hybrid queries NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For cannot be jointly optimized as sub-queries are executed on any use beyond those covered by this license, obtain permission by emailing two engines independently. [email protected]. Copyright is held by the owner/author(s). Publication rights To address this challenge, we design and implement a new licensed to the VLDB Endowment. analytical engine, called AnalyticDB-V (ADBV) inside the Proceedings of the VLDB Endowment, Vol. 13, No. 12 OLAP system AnalyticDB (ADB) [53] at Alibaba Cloud, ISSN 2150-8097. that manages massive feature vectors and structured data DOI: https://doi.org/10.14778/3415478.3415541 3152 Image Retrieval System Unstructured Data Structured Data I Query with Looks like prefer Feature Extraction red Image Only ORDER BY Feature Vector DISTANCE(clothes.feature, clothes.color = ‘red’ query_feature) SELECT * FROM clothes SQL Dialect Red color WHERE clothes.color = Hybrid Query ‘red’ Query ORDER BY DISTANCE(clothes.featur e, query_feature) Alice Hybrid Analytical Engine LIMIT k; AnalyticDB-V(ADBV) Figure 1: Hybrid query example. and natively supports hybrid query. During the design and data with real-time updates. To fulfill the real-time development of this system, we have encountered and ad- requirement, we adopt the lambda framework with dif- dressed several vital challenges: ferent ANNS indexes for the streaming layer and the Real-time management of high-dimensional vec- batching layer. The neighborhood-based ANNS meth- tors. The extracted feature vectors from unstructured data ods in the streaming layer support real-time insertions are usually of extremely high dimensions. For example, at but consume a lot of memory. The encoding-based Alibaba, vectors for unstructured data can reach 500+ di- ANNS methods in the batching layer consume much less mensions in many scenarios, such as online shopping applica- memory, but require offline training before construc- tions. In addition, these vectors are being generated in real- tion. Lambda framework can periodically merge newly time. The real-time management (i.e., CRUD operations) ingested data from the streaming layer into the batch- on such high-dimensional vectors is burdensome for existing ing layer. databases and vector search engines. On one hand, online • A new ANNS algorithm. For the sake of improving the database systems with similarity search support (e.g., Post- accuracy on large-scale vectors representing massive greSQL and MySQL) only works for vectors of up to tens unstructured data, a novel ANNS index, called Voronoi of dimensions. On the other hand, vector similarity search Graph Product Quantization (VGPQ), is proposed. This engines (such as Faiss) adopt ANNS (Approximate Nearest algorithm could efficiently narrow down the search sco- Neighbor Search) approaches [33, 23] to process and index pe in the vector space compared to IVFPQ [23] with high-dimensional vectors in an offline manner, which fail to margi-nal overhead. According to the empirical study, handle real-time updates. VGPQ is more effective for fast indexing and queries on Hybrid query optimization. Hybrid query lends new massive vectors than IVFPQ. opportunities for joint execution and optimization consider- • Accuracy-aware cost-based hybrid query optimization. ing both feature vectors and structured attributes. However, In ADBV, ANNS algorithms are wrapped as physical the hybrid query optimization is inherently more complex operators. Hence, we can rely on query optimizer to than existing optimizations. Classical optimizers that sup- efficiently and effectively support hybrid query process- port top-k operations [29, 20, 19] do not have to consider ing. Physical operators in a relational database always the accuracy issue, i.e., all query plans lead to identical ex- return exact results. However, these newly introduced ecution results. However, for hybrid queries, approximate physical operators may not strictly follow relational results are returned by ANNS (on the vectors) to avoid ex- algebra, and output approximate results instead. Due haustive search, and hence the accuracy of top-k operations to the nature of approximation, we deliver new op- varies with the choice of ANNS methods and parameter set- timization rules to achieve the best query efficiency. tings. There remains a non-trivial task to balance the qual- These rules are naturally embedded in the optimizer ity of approximated results and the query processing speed. of ADBV. High scalability and concurrency. In many of our production environments, vectors are managed at extremely In the following sections, we will provide details of ADBV. x2 large scales. For example, in a smart city transportation introduces the background