Alicg: Fine-Grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba

Alicg: Fine-Grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba

AliCG: Fine-grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba Ningyu Zhang1,2∗, Qianghuai Jia4∗, Shumin Deng1,2∗, Xiang Chen1,2, Hongbin Ye1,2, Hui Chen3, Huaixiao Tou3, Gang Huang5, Zhao Wang1, Nengwei Hua3, Huajun Chen1,2¢ 1Zhejiang University, China & AZFT Joint Lab for Knowledge Engine, China 3Alibaba Group, China 2Hangzhou Innovation Center, China, Zhejiang University, China, 4AntGroup, China 5Zhejiang Lab, China {zhangningyu,231sm,xiang_chen,yehongbin,huanggang,zhao_wang,huajunsir}@zju.edu.cn, {qianghuai.jqh,weidu.ch,huaixiao.thx,nengwei.huanw}@alibaba-inc.com ABSTRACT 1 INTRODUCTION Conceptual graphs, which is a particular type of Knowledge Graphs, Knowledge is important for text-related applications such as se- play an essential role in semantic search. Prior conceptual graph mantic search. Knowledge Graphs (KGs) organize facts in a struc- construction approaches typically extract high-frequent, coarse- tured graph way as triples in the form of <subject, predicate, object>, grained, and time-invariant concepts from formal texts. In real abridged as ¹B, ?, >º, where B and > denote entities and ? builds rela- applications, however, it is necessary to extract less-frequent, fine- tions between entities. Conceptual Graph, which is a special type grained, and time-varying conceptual knowledge and build taxon- of KGs, builds the semantic connections between concepts and has omy in an evolving manner. In this paper, we introduce an approach proven to be valuable in short text understanding [21], Word Sense to implementing and deploying the conceptual graph at Alibaba. Disambiguation [20], enhanced entity linking [5], semantic query Specifically, We propose a framework called AliCG which is capa- rewriting [22], etc. Essentially, conceptualization helps humans ble of a) extracting fine-grained concepts by a novel bootstrapping generalize previously gained knowledge and experience to new with alignment consensus approach, b) mining long-tail concepts settings, which may reveal paths to high-level cognitive System 2 with a novel low-resource phrase mining approach, c) updating the [2] in a conscious manner. In real-life applications, the conceptual graph dynamically via a concept distribution estimation method graph provides valuable knowledge to support many applications based on implicit and explicit user behaviors. We have deployed the [25], such as semantic search. Web search engines (e.g., Google and framework at Alibaba UC Browser. Extensive offline evaluation as Bing) leverage a taxonomy to better understand user queries and well as online A/B testing demonstrate the efficacy of our approach. improve the search quality. Moreover, many online retailers (e.g., Alibaba and Amazon) organize products into categories of different granularities so that customers can easily search and navigate this CCS CONCEPTS category taxonomy to find the items they want to purchase. • Information systems ! Query representation; Information In this paper, we introduce the Alibaba Conceptual Graph (Al- extraction. iCG), which is a large-scale conceptual graph of more than 5,000,000 fine-grained concepts, still in fast growth, automatically extracted KEYWORDS from noisy search logs. As shown in Figure 1, AliCG comprises level-1 Concept Mining; Taxonomy Construction; Knowledge Graph four levels: consists of concepts expressing the domain that those instances belong to; level-2 consists of concepts referred to ACM Reference Format: level-3 1,2∗ 4∗ 1,2∗ 1,2 the type or subclass of instances; consists of concepts that Ningyu Zhang , Qianghuai Jia , Shumin Deng , Xiang Chen , are the fine-grained conceptualization of instances expressing the Hongbin Ye1,2, Hui Chen3, Huaixiao Tou3, Gang Huang5, Zhao Wang1, implicit user intentions; instance layer includes all instances such Nengwei Hua3, Huajun Chen1,2¢ . 2021. AliCG: Fine-grained and Evolvable as entities and non-entity phrases. AliCG is currently deployed at arXiv:2106.01686v1 [cs.AI] 3 Jun 2021 Conceptual Graph Construction for Semantic Search at Alibaba. In Proceed- ings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Alibaba to support a variety of business scenarios, including the Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. ACM, New product Alibaba UC Browser. AliCG has been applied to more than York, NY, USA, 12 pages. https://doi.org/10.1145/3447548.3467057 dozens of applications in Alibaba UC Browser, including intent classification, named entity recognition, query rewriting, andso ∗ Equal contribution and shared co-first authorship. ¢ Corresponding author. on, and it receives more than 300 million requests per day. Building AliCG is not a trivial task. Previous studies such as Permission to make digital or hard copies of all or part of this work for personal or YAGO [18] and DBPedia [1] have investigated the extraction of classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation knowledge from formal texts (e.g., Wikipedia). Probase [23] pro- on the first page. Copyrights for components of this work owned by others than ACM poses an approach for extracting concepts from semi-structured must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, Web documents. However, these approaches could not be adapted to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. to our applications because several challenges remain unresolved. KDD ’21, August 14–18, 2021, Virtual Event, Singapore Fine-grained Concept Acquisition. Conventional approaches © 2021 Association for Computing Machinery. devoted to extracting coarse-grained concepts such as categories ACM ISBN 978-1-4503-8332-5/21/08...$15.00 https://doi.org/10.1145/3447548.3467057 or types. However, in Alibaba’s scenario of question fine-grained KDD ’21, August 14–18, 2021, Virtual Event, Singapore Ningyu Zhang et al. Query: Thing 1.0 1.0 Is it safe to eat durian during 1.0 1.0 1.0 confinement? level1 concept 健康 health 娱乐 entertainment 人物 people 组织 organization tool Concepts of the Query: 工具 • during confinement isA 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 postpartum confinement period 饮食 生理 疾病 歌曲 电影 虚拟人物 公司 厨具 • eat durian isA virtual level2 concept diet physiology disease song film company kitchenware eat tropical fruit character 1.0 1.0 0.6 1.0 产褥期 古风歌曲 神话角色 互联网公司 0.8 0.9 1.0 postpartum 1.0 antiquity mythological 1.0 Internet confinement period song character company Concept Level Matching level3 concept eat tropical rare mental Chinese folk tale barbecue fruit disorder animation film character tool 0.9 吃热带水果 罕见精神障碍 中国动漫电影 传说人物 0.9 烧烤工具 0.9 0.9 Concepts of the Documents: 0.1 0.6 0.8 0.1 • postpartum confinement period 0.9 0.9 0.8 0.9 during aboulomania Google • eat tropical fruit instance confinement 坐月子 body integrity 意志缺失狂 谷歌 • eat fruit eat durian Nezha Zeus Alibaba grill identity disorder • dietary contraindications 吃榴莲 身体完整性认同障碍 哪吒 宙斯 阿里巴巴 烧烤架 Figure 1: Data hierarchy of Alibaba Conceptual Graph (AliCG) for semantic search. concepts are necessary to increase the recall of answer results. edges, it is necessary to align those nodes with the same meaning For example, “grill (烤¶)” is a “tool (工w)” and “scarf (围巾)” is in the conceptual graph. Besides, as there are many multiple-to- a "clothes (服p)”. However, it would be more helpful if we can multiple nodes in the conceptual graph, it is prohibitive to update infer that a user searching for these items may be more interested such complex graphs over time. In other words, it is difficult to es- in “barbecue tool (烧烤工w)” or "keep warm clothes (保暖服 timate the confidence distribution of the concepts given instances. p)” rather than another "tool (工w)" like “wrench (sK)”—these To address challenges mentioned above, we propose the follow- concepts are rare in existing conceptual graphs. ing contributions in the design of AliCG: Long-tail Concept Mining. Conventional approaches [23] gen- First, we propose a novel bootstrapping with the alignment erally extract concepts based on Hearst patterns, e.g., "especially" consensus approach to tackle the first challenge of extracting fine- and "such as." However, these approaches cannot extract long-tail grained concepts from noisy search logs. Specifically, we utilize concepts from extremely short or noisy queries, which are com- a small number of predefined string patterns to extract concepts, mon in search engines. For instance, it is non-trivial to extract the which are then used to expand the pool of such patterns. Further, the concept "rare mental disorder (UÁ精^障碍)" of the instance new mined concepts are verified with query-title alignment; that is, "body integrity identity disorder («S完t'¤同障碍)" from the an essential concept in a query should repeat several times in the search log as only 35 instances mentioned "rare mental disorder". It document title frequently clicked by the user. Second, we introduce is rather difficult to extract such concepts from the short text with a novel conceptualized phrase mining and self-training with pattern matching (the pattern is too general [6–8, 26, 27], and there an ensemble consensus approach to extract long-tail concepts. On is little context information as well as co-occurrence samples), as the one hand, we extend the off-the-shelf phrase mining algorithm Figure 2 shows. Besides, there exist lots of scattered concepts in with conceptualized features to mine concepts unsupervisedly. On user search engines such as "traditional activities Tibetan New Year the other hand, we propose a novel low-resource sequence tagging (Ï历新t`俗)". Recent approaches usually regard such concept framework, namely, self-training with an ensemble consensus, to extraction procedure as sequence labeling tasks [13], which rely extract those scattered concepts. Finally, we propose a novel con- on a tremendous amount of training data for each concept.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us