Part II: Multi-faceted Taxonomy Construction

KDD 2020 Tutorial Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis Yu Meng, Jiaxin Huang, Jiawei Han , University of Illinois at Urbana-Champaign August 23, 2020 Outline

q Taxonomy Basics and Construction

q What is taxonomy and why use taxonomy?

q Parallel Concept Discovery: Entity Set Expansion

q Taxonomy Construction from Scratch

q Taxonomy Expansion

2 What is a Taxonomy?

❑ Taxonomy is a hierarchical organization of concepts ❑ For example: Wikipedia category, ACM CCS Classification System, Medical Subject Heading (MeSH), Amazon Product Category, Yelp Category List, WordNet, and etc.

artefact

Motor vehicle

go-kart motorcar truck

hatch-back compact gas guzzler

Wikipedia Category MeSH Amazon Product Category WordNet 3 Why do we need a Taxonomy?

❑ Taxonomy can benefit many knowledge-rich applications ❑ ❑ Knowledge Organization ❑ Document Categorization ❑ Recommender System

IR ML NLP GPU Method view

Dataset Processing similar Unit Application recommend Share features 2016 Corpus 2017 2018 TPU 4 Multi-dimensional Corpus Index Two types of Taxonomy

❑ Clustering-based Taxonomy ❑ Instance-based Taxonomy

5 Multi-faceted Taxonomy Construction

❑ Limitations of existing taxonomy: ❑ A generic taxonomy with fixed “is-a” relation between nodes ❑ Fail to adapt to users’ specific interest in special areas by dominating the hierarchical structure of irrelevant terms ❑ Multi-faceted Taxonomy ❑ One facet only reflects a certain kind of relation between parent and child nodes in a user-interested field.

Relation: IsLocatedIn 6 Relation: IsSubfieldOf Two stages in constructing a complete taxonomy

❑ Taxonomy Construction from Scratch ❑ Use a set of entities (possibly a seed taxonomy in a small scale) and unstructured text data to build a taxonomy organized by certain relations ❑ Taxonomy Expansion ❑ Update an already constructed taxonomy by attaching new items to a suitable node on the existing taxonomy. This step is useful since reconstructing a new taxonomy from scratch can be resource-consuming.

7 Concept Expansion as a Flat Version

❑ If a seed taxonomy is provided by user, then we can gradually expand a hierarchical structure by the following two sub-tasks: ❑ (1) concept expansion as a flat version to expand a wide range of entities on the same level; ❑ (2) taxonomy construction as a hierarchical version to capture user- interested relations.

8 Outline

q Taxonomy Basics and Construction

q Parallel Concept Discovery: Entity Set Expansion

q EgoSet [WSDM' 16]

q SetExpan [ECML PKDD’17]

q SetCoExpan [WWW’20]

q CGExpan [ACL’20]

q Taxonomy Construction from Scratch

q Taxonomy Expansion 9 Automated Corpus-Based Set Expansion

q Corpus-based set expansion: Find the “complete” set of entities belonging to the same semantic class, based on a given corpus and a tiny set of seeds q Ex. 1. Given {Illinois, Maryland}, derive all U.S. states q Ex. 2. Given {, data mining,…}, derive CS disciplines q Challenges: Deal with noisy context feature derived from free-, which may lead to entity intrusion and semantic drifting

10 Previous Work on Set Expansion

q Search engine-based, online processing: E.g., Google Set, SEAL, Lyretail q A query consisting of seed entities is submitted to a search engine q Good quality but time-consuming and costly q Corpus-based set expansion: offline processing based on a specific corpus q Two approaches: One-time entity ranking and iterative pattern-based bootstrapping q One-time entity ranking: Similar entities appear in similar contexts q One-time ranking of candidates based on their distributional similarity with seeds q One-time is hard to obtain the full set; Entity intrusion error: wrong one intruded q Iterative pattern-based bootstrapping: q From seeds to extract quality patterns, based on a predefined scoring mechanism q Then apply extracted patterns to obtain even higher quality entities using another scoring method. q 11 Semantic drifting: Non-perfect extraction leads to drifting EgoSet: Methodology

q Ontologies and skip grams: Combine existing user-generated ontologies Wikipedia lists word ego-network (Wikipedia) with a novel word-similarity metric based on skip-grams q Ego-network generation: Treat words that are distributionally similar to the seed (the ego) as nodes and use the pairwise similarity between those words to create weighted edges, thereby forming an “ego-network” q EgoSet discovery: Use the ego-network to find the initial clusters for a seed, and align those clusters with user-created • Ontologies are a natural source for set expansion ontologies • List-of pages of Wikipedia were found to have the right combination of being prevalent and relatively 12 “clean” Outline

q Taxonomy Basics and Construction

q Parallel Concept Discovery: Entity Set Expansion

q EgoSet [WSDM' 16]

q SetExpan [ECML PKDD’17]

q SetCoExpan [WWW’20]

q CGExpan [ACL’20]

q Taxonomy Construction from Scratch

q Taxonomy Expansion 13 SetExpan: Context Feature Selection and Rank Ensemble

The SetExpan Framework q Instead of using all context features, context features are carefully selected by calculating distributional similarity q High-quality feature pool will be reset at the beginning of each iteration q Unsupervised ranking-based ensemble method at each iteration: robust to noisy or wrongly extracted pattern features 14 Outline

q Taxonomy Basics and Construction

q Parallel Concept Discovery: Entity Set Expansion

q EgoSet [WSDM' 16]

q SetExpan [ECML PKDD’17]

q SetCoExpan [WWW’20]

q CGExpan [ACL’20]

q Taxonomy Construction from Scratch

q Taxonomy Expansion 15 SetCoExpan: Auxiliary Sets Generation and Set Co-Expansion

q J. Huang, Y. Xie, Y. Meng, J. Shen, Y. Zhang and J. Han, ”Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion”, WWW’20 q Semantic drifting problem: q Existing set expansion algorithms typically bootstrap the given seeds by incorporating lexical patterns and distributional similarity q Typical errors q Similar concept but wrong granularity q Countries: (United States, Canada, China) → Ontario, Illinois, California q Related but not similar concepts q Sports_leagues: (NHL, NFL, NBA) → Chicago Rush, San Francisco Giants, Yankee q Companies: (Google, Apple Inc. , Microsoft) → google maps, ios, android q Diseases: (lung cancer, AIDS, depression) → chest pain, headache, fever 16 Resolving the Semantic Drift Issue by Set Co-expansion

q Automatically generate auxiliary sets as negative sets that are closely related to the target set of user’s interest q Auxiliary set: holding certain subtle relations in the embedding space with the user-interested semantic class q Co-expand multiple coherent sets that are distinctive from one another q Expand all the sets in parallel to mutually enhance each other by finding the most contrastive features to tell their difference q Example: q User input: Australia, France, Germany q Generated Auxiliary sets: q (Provinces): Queensland, New South Wales, Saxony, Bavaria q (Cities): Brisbane, Canberra, Rennes, Hamburg

17 General Framework of Set Co-expansion

q At each iteration of expanding the seed set, perform two steps q Auxiliary Sets Generation: Run CatE then Clustering to automatically generate auxiliary sets (related to but different from the semantic class of user input) q Multiple Sets Co-Expansion: Expand multiple sets simultaneously by extracting the most contrastive features, and expand each set in the direction away from other sets

18 Auxiliary Sets Generation

q Semantic Learning and Related Terms Retrieval for Seed Entities q Generate representative terms for each entities in the seed set q Cross-Seed Parallel Relations Clustering q Intra-seed clustering: Cluster terms related to each seed into initial semantic groups q Inter-seed clustering: Merge initial groups across different seeds using the equation:

q Remove groups that cannot match in different sets—retain only the cross-seed groups

19 Multiple Sets Co-Expansion

q Iteratively refine feature pool and candidate pool in set expansion q Feature pool stores common context features of seed entities, that best distinguish the target semantic class from auxiliary ones q Skip-grams that make each set coherent while distinguishing different sets are encouraged q Candidate pool stores the possible candidate entities to be expanded, and they are narrowed down by co-occurrence with features in the feature pool

20 Experiments and Performance Study

q Experiment data sets: q Each class includes 5 queries q from the same semantic class (e.g., Countries, Companies, Sports Leagues) q Evaluation Metric: Mean Average Precision (MAP) q Methods compared q SetExpan (ECMLPKDD’17) q SetExpander (EMNLP’18) q CaSE (SIGIR’19) q BERT (NAACL’19)

When the ranking list is longer (i.e., when the seed set gradually grows out of control and more noises appear), SetCoExpan is able to steer the direction of expansion and set barriers to prevent out-of category words from coming in 21 Case Studies

Negative seeds (auxiliary sets) generated for various queries (in Wiki Dataset)

Co-Expansion vs. Separate Expansion of Target Set and Auxiliary Sets

22 Outline

q Taxonomy Basics and Construction

q Parallel Concept Discovery: Entity Set Expansion

q EgoSet [WSDM' 16]

q SetExpan [ECML PKDD’17]

q SetCoExpan [WWW’20]

q CGExpan [ACL’20]

q Taxonomy Construction from Scratch

q Taxonomy Expansion 23 Class-guided Set Expansion: Overall Idea

q We propose to empower entity set expansion with class names automatically generated from pre- trained language models, which can help us identify unambiguous patterns and eliminate erroneous entities q CGExpan: Class-Guided Set Expansion

24 24 Probing Queries

q Hearst Patterns: a set of lexico-syntactic patterns inducing hypernym relations q E.g. “countries such as US and China” -> China, US ∈ Countries q Probing Query: a word sequence containing one [MASK] token q Class-probing query: predict class name of some given entities q E.g. “[MASK] such as USA, China, and Canada” q Entity-probing query: retrieve entities given class name and some seed entities q E.g. “Canada, [MASK], or other countries” q By inputting a probing query, we can get the contextualized embedding of the [MASK] token and let MLM predict the missing word

25 25 CGExpan: Class Name Generation

q Iteratively submit class-probing queries to a language model to get multi-gram class names q Repeat the process by randomly sampling entities q Keep all generated class names that are noun phrases

26 26 CGExpan: Class Name Ranking

q Identify the top-k most similar occurrences of an entity with the embedding vector of an entity- probing query and take their average as the similarity between the entity and a class name q Aggregate all ranked lists (one for each entity) and select the top one as the positive class name, �� q Select class names ranking lower than � in all lists corresponding to the initial seed set as negative class names, �� 27 27 CGExpan: Class-Guided Entity Selection

q Prefer entities that appear at top position in multiple entity rank lists q Filter out entities that are more similar to any �′ ∈ � than � q Assign higher score to entities currently in the set

28 28 Case Study: Class Name Selection

29 29 Outline

q Taxonomy Basics and Construction q Parallel Concept Discovery: Entity Set Expansion q Taxonomy Construction from Scratch q Instance-based Taxonomy Construction q Hypernym-hyponym detection q HiExpan: Task-guided Taxonomy Construction by Hierarchical Tree Expansion [KDD’18] q Clustering-based Taxonomy Construction q Taxonomy Expansion

30 Instance-based Taxonomy Construction: Overview

q Decompose taxonomy construction into multiple subtasks Input data Extracted Pairs q End-to-end approach vertebrate Text corpus Hypernymy Hypernymy mammal reptile Detection Organization OR/AND Term dog panda cat lizard List …...

q Pattern-based approach q Simple pruning heuristics q Supervised approach q Graph-based approach 31 Hypernymy Detection

q Pattern-based approach: use patterns to extract hypernym-hyponym relations from raw text q Lexical-syntactic pattern [Hearst’92] [Kozareva and Hovy’10], [Luu et al.’14] q Supervised approach: train a classifier to predict whether two terms in vocabulary hold hypernymy relation q Leverage multiple features: q Term embedding: [Fu et al.’ 14] [Yu et al.15] [Luu et al.’16] [Weeds et al.’16] q Dependency path: [Snow et al.’04] [Snow et al.’06] [Shwartz et al.’16] [Mao et al.’18]

32 Hypernymy Organization

q Simple pruning heuristics: q Remove cycle [Kozareva and Hovy’10] [Faralli et al.’15] q Retain longest-path [Kozareva and Hovy’10] q Graph-based approach: q Maximum Spanning Tree [Paola et al.’13] [Bansal et al.’14] [Zhang et al.’16] Figure credits to [Paola et al.’13]

33 Noisy Graph Trimmed Graph with Edge Weights Induced DAG Limitations of Existing Methods

q Limitations: Build a corpus-agnostic, task-agnostic taxonomy with mainly is- A relation q Inflexible semantics: cannot model flexible edge semantics (e.g., “country-state-city”) q Limited applicability: cannot fit user-specific application tasks

34 Outline

q Taxonomy Basics and Construction q Parallel Concept Discovery: Entity Set Expansion q Taxonomy Construction from Scratch q Instance-based Taxonomy Construction q Hypernym-hyponym detection q HiExpan: Task-guided Taxonomy Construction by Hierarchical Tree Expansion [KDD’18] q Clustering-based Taxonomy Construction q Taxonomy Expansion

35 HiExpan: User/Task-Guided Taxonomy Construction

q Input: A user provides: U.S Canada Texas Ontario Urbana London

Species Genus Family Order Class Family q a domain-specific corpus, and ...... q a seed taxonomy as task M Zuckerberg D Trump E Musk L Blankfein Text C orpora Application Material Method Area guidance Task Term List Seed guidance q Model outputs: Taxonomy Root Root q A corpus-dependent taxonomy on location

...... tailored for user’s task U.S. U.S. China Canada China

...... q Distinction: Task-guided taxonomy User California Illinois California Illinois Texas Arizona Beijing Shanghai Ontario Quebec Input “Seed” Taxonomy Output Task-guided Taxonomy construction Shen, Jiaming, Zeqiu Wu, Dongming Lei, Chao Zhang, q Corpus-dependent Xiang Ren, Michelle Vanni, Brian M. Sadler and Jiawei q Leverage user’s seed guidance Han. “HiExpan : Task-Guided Taxonomy Construction by Hierarchical Tree Expansion.” KDD (2018) 36 The HiExpan Framework & Width Expansion

q The HiExpan core idea: View all children under each taxonomy node forming a coherent set and build the taxonomy by expanding all these sets q Use set expansion algorithm to expand all sets q Recursively expand the sets in a top-down fashion Width expansion: The width of taxonomy tree increases (i.e., expanded)

Root Root Root

Coherent set 1 Coherent set 1 Canada Mexico Russia India U.S. U.S. China U.S. China Canada Mexico China Florida Maryland

California Illinois California Illinois Texas Arizona California Illinois Texas Arizona Coherent set 2 Coherent set 2

Iteration 0 Iteration 1 Iteration 2 37 How to Dig Deeper? Cold-Start with Empty Initial Seed Set

q Newly-added nodes in taxonomy tree do not have any child node q How to acquire a target node’s initial children? q Depth Expansion q Based on US (California, Illinois, …), find Canada (Ontario, Quebec, …), Mexico (…) q Based on term embedding and embedding vector similarity

Root Root

Newly-added nodes

Canada Mexico U.S. China U.S. China Canada Mexico ......

California Illinois Texas Arizona California Illinois Texas Arizona Ontario Quebec Sonora Coahuila

Initial children nodes 38 How to find these initial children node? Outline

q Taxonomy Basics and Construction q Parallel Concept Discovery: Entity Set Expansion q Taxonomy Construction from Scratch q Instance-based Taxonomy Construction q Clustering-based Taxonomy Construction q Hierarchical Topic Models q TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering [KDD’18] q CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring [KDD’20] q NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network [WWW’20] q Taxonomy Expansion

39 Hierarchical

q Use a cluster of terms (i.e., a topic) to represent a concept and organize topics in a hierarchical way q Pose different statistical assumptions on the data generation process q Nested Chinese Restaurant Process: q hLDA [Blei et al.’03], hLDA-nCRP [Blei et al.’ 10] q Pachinko Allocation Model: q PAM [Li and McCallum’06], hPAM [Mimno et al.’07] q Dirichlet Forest Model: q DF [Andrzejewski et al.’09], Guided HTM [Shin and Moon’17]

40 Example: hLDA

q Assume documents are generated by a nested Chinese Restaurant Process

Generates

We develop an approach to risk minimization Inference and stochastic optimization that provides a convex surrogate for variance, allowing near- optimal and computationally efficient trading between approximation and estimation error.

41 “Observed” documents Figure credits to [Blei et al.’03] Example: hPAM

q Assume documents are generated by a mixture of super-topic

sub-topic

Generates

We develop an approach to risk minimization Inference and stochastic optimization that provides a convex surrogate for variance, allowing near- optimal and computationally efficient trading between approximation and estimation error.

42 “Observed” documents Figure credits to [Mimno et al.’07] Outline

q Taxonomy Basics and Construction q Parallel Concept Discovery: Entity Set Expansion q Taxonomy Construction from Scratch q Instance-based Taxonomy Construction q Clustering-based Taxonomy Construction q Hierarchical Topic Models q TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering [KDD’18] q CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring [KDD’20] q NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network [WWW’20] q Taxonomy Expansion

43 TaxoGen: Unsupervised Construction with Term Embedding

❑ Automated construction of topic taxonomy Documents ❑ Selected method: spherical clustering—use embeddings to find semantically consistent clusters ❑ Domain-specific terms can be clustered together ❑ “machine learning”, “learning algorithm”, ... ❑ Where do the general terms go? ❑ “computer science”, “method”, “paper”, … Topic Dimension

“computer science”

“graphics”

“machine learning”

44 Spherical Clustering + Local Embedding

After pushing up general terms, the remaining terms become more separable

❑ Design a ranking module to select representative phrases for each cluster ❑ Conduct comparative analysis (combining popularity and concentration) ❑ Does this phrase better fit my cluster or my sliblings‘? ❑ Push the background phrases back to the general node ❑ “computer science”, “paper” → the higher-level node (root node) ❑ “machine learning”, “ml”, “classification“ → the “ML” node ❑ Local embedding: ❑ For each “sub-topic” node, learn local embedding only on relevant documents ❑ Only perserve information relevant to the “sub-topic”

45 Outline

q Taxonomy Basics and Construction q Parallel Concept Discovery: Entity Set Expansion q Taxonomy Construction from Scratch q Instance-based Taxonomy Construction q Clustering-based Taxonomy Construction q Hierarchical Topic Models q TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering [KDD’18] q CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring [KDD’20] q NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network [WWW’20] q Taxonomy Expansion

46 Seed-Guided Topical Taxonomy Construction

❑ Previous clustering-based methods generate generic topical taxonomies which cannot satisfy user’s specific interest in certain areas and relations. Countless irrelevant terms and fixed “is-a” relations dominate the instance taxonomy. ❑ We study the problem of seed-guided topical taxonomy construction, where user gives a seed taxonomy as guidance, and a more complete topical taxonomy is generated from text corpus, with each node represented by a cluster of terms (topics). Input 1: Seed Taxonomy Root Root Food BreadMenu CourseBeef LunchSoup A user might want to learn Dessert DessertDinner about concepts in a certain Seafood Dessert Seafood Salad Cake DressingCrab Dressing Pudding Mixed_greensCrowfish Mixed greens aspect (e.g., food or Sugar Goat cheese Goat cheese Cake Ice-cream Shrimp Mochi SashimiLettuce Lettuce Caramel ScallopTomato Tomato research areas) from a corpus. He wants to know more about other kinds of Cake Oysters Crabs

______Creme Brûlée Oysters Crabs ______food. ______Tiramisu Fresh Oysters King Crabs ______Chocolate Cake Raw Oysters Snow Crabs ______Raw Oysters Snow Crabs ______Cheesecake Shellfish Stone Crabs ______Shellfish Stone Crabs Bread Pudding Fried Oysters Crab Legs

User Input 2: Corpus Output: Topical Taxonomy 47 CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring

Lunch Food Dish Root Lunch Food Dish Lunch Food Dish

Dessert Dessert Pork Dessert Seafood Pork Cake Shrimp Roasted- Dessert Pudding Crab pork Seafood Seafood Seafood

Cake Ice-cream Cake Ice-cream Cake Crab Char siu Ice-cream Oyster Sausage Cake Ice-cream Char siu Sausage Chocolate Cake Sundae Pork bun Bacon Cheesecake Milkshake Chasu Ham

Step 1: Relation Step 2: Relation Step 3: Concept learning for generating transferring upwards transferring downwards topical clusters

Step 1: Learn a relation classifier and transfer the relation upwards to discover common root concepts of existing topics. Step 2: Transfer the relation downwards to find new topics/subtopics as child nodes of root/topics. Step 3: Learn a discriminative embedding space to find distinctive terms for each concept node in the taxonomy. 48 Relation Learning

q We adopt a pre-trained deep language model to learn a relation classifier with only the user- given parent-child () pairs. q Training samples: We generate relation statements from the corpus as training samples for this classifier. We assume that if a pair of co-occurs in a sentence in the corpus, then that sentence implies their relation.

R0 R1 R2 R0 E1 E2 R1 E2 E1 Linear Layer + Softmax R2 No user- interested Concatenation relation E1 E2 [MASK] [MASK] Pre-trained Language Model

[CLS] We don’t serve [MASK] today ( except for [MASK] ) . [SEP]

We don’t serve desserts today ( except for ice_cream ) . 49 Relation Transferring

q We first transfer the relation upwards to discover possible root nodes (e.g., “Lunch” and “Food”). This is because the root node would have more general contexts for us to find connections with potential new topics. Lunch Food Dish Root Lunch Food Dish Lunch Food Dish

Dessert Dessert Pork Dessert Seafood Pork Cake Shrimp Roasted- Dessert Pudding Crab pork Seafood Seafood Seafood

Cake Ice-cream Cake Ice-cream Cake Crab Char siu q We extract a list of parent nodes for each seed topicIce-creamusing the Oysterrelation Sausageclassifier. TheCake Ice-cream Char siu Sausage Chocolate Cake Sundae Pork bun Bacon common parent nodes shared by all user-given topics are treated as root nodes. Cheesecake Milkshake Chasu Ham

q To discover new topics (e.g,StepPork), 1: Relationwe transfer the relationStepdownwards 2: Relation from theseSteproot 3: Concept learning for generating nodes. transferring upwards transferring downwards topical clusters

50 Relation Transferring

q We then transfer the relation downwards from each internal topic node to discover their subtopics. q Since each candidate term has multiple mentions in the corpus, leading to multiple relation statements. We only count those confident predictions, and if the majority of these predictions judge the candidate term � as the child node of �, we retain the candidate term to be clustered later. Lunch Food Dish Root Lunch Food Dish Lunch Food Dish

Dessert Dessert Pork Dessert Seafood Pork Cake Shrimp Roasted- Dessert Pudding Crab pork Seafood Seafood Seafood

Cake Ice-cream Cake Ice-cream Cake Crab Char siu Ice-cream Oyster Sausage Cake Ice-cream Char siu Sausage Chocolate Cake Sundae Pork bun Bacon Cheesecake Milkshake Chasu Ham 51 Step 1: Relation Step 2: Relation Step 3: Concept learning for generating transferring upwards transferring downwards topical clusters Concept Learning

q Our concept learning module is used to learn a discriminative embedding space, so that each concept is surrounded by its representative terms. Within this embedding space, subtopic candidates are also clustered to form coherent subtopic nodes. q Fine-grained concept names can be close in the embedding space, and directly using unsupervised might result in relevant but not distinctive terms (e.g., ``food‘’ is relevant to both ``seafood‘’ and ``dessert‘’). q Therefore, we leverage a weakly-supervised text embedding framework to discriminate these concepts in the embedding space, and this algorithm will be introduced in the next section. q Subtopics should satisfy the following two constraints: q 1. must belong to representative words of that parent topic. q 2. must share parallel relations with given seed taxonomy. 52 Qualitative Results

*

Machine Learning Data Mining Natural Language Processing

Support vector Decision Neural Text Web Association Named Entity Machine Information machines Trees Networks Mining Mining Rule Mining Recognition Translation Extraction

*

Machine Learning Image Processing Data Mining Information Retrieval Computer Security Pattern Recognition Database Statistical machine learning Image analysis KDD Text retrieval Authentication Pattern recognition Databases Supervised learning Edge detection Knowledge discovery Document retrieval Information security Pattern classification Repositories Ensemble learning Machine vision Data analysis IR Pki Feature extraction Biological database Transfer learning Image enhancement Retrieval models Cryptographic Image recognition Object database Meta-learning Medical imaging Cluster analysis Retrieval systems Key management Image classification Relational database

Outlier Detection Clustering Data Stream Miniing Social Network Analysis Hand-writing Recognition Person Identification Image Matching Anomaly detection Clustering methods Streaming data Online social networks Hand-written characters Personal identification Image matching Network intrusion detection Clustering algorithms Data stream Social media Chinese characters Biometrics Zernike moments Fraud Hierarchical clustering Temporal data Link analysis Character recognition Iris recognition Shape matching Intrusion K-means Continuous queries Communities Signature verification Gabor wavelets Pose estimation Intrusion detection Agglomerative clustering Trajectory data Centrality ocr Biometric systems Shape representation 53 Qualitative Results

*

Dessert Salad Seafood

Cake Ice-cream Pastries *

Dessert Seafood Salad Soup Pork Beef Caramel Crabs Dressing Lentil soup Roasted pork Tendon Pudding Clams Mixed Greens Chowder Pork shoulder Tripe Strawberry Crawfish Spring Mix Butternut squash soup Shredded pork Shank Cheesecake Squid Lettuce Tom yum soup Pork rind Sliced beef Chocolate Shellfish Tomato Noodle soup Marinated pork Flank steak

Crab Shrimps Oysters Fish Char siu Pork Steak Sausage Crab Shrimp Fresh oysters Seabass Char siu Pork rib Kielbasa sausage King crab Fried shrimp Frog legs Halibut Roasted pork Pork tenderloin Bacon King crab legs Jumbo shrimp Raw oysters Trout Minced pork Chops Crispy bacon Snow crab legs Prawns Oyster Unagi Pork bun Crispy skin Sauerkraut Crab legs Scampi Rockefeller Swordfish Xiao long bao Pork loin Ham 54 Outline

❑ Taxonomy Basics and Construction ❑ Parallel Concept Discovery: Entity Set Expansion ❑ Taxonomy Construction from Scratch ❑ Instance-based Taxonomy Construction ❑ Clustering-based Taxonomy Construction ❑ Hierarchical Topic Models ❑ TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering [KDD’18] ❑ CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring [KDD’20] ❑ NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network [WWW’20] ❑ Taxonomy Expansion

55 NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network ❑ Besides leveraging unstructured text data, we can take the meta-data of documents into consideration and view the corpus as a text-rich network. ❑ Terms in scientific papers linked by the same venue or author can belong to the same research field, such as “social network” and “information cascade”.

56 NetTaxo: Automated Topic Taxonomy Construction from Text-Rich Network ❑ A motif pattern Ω refers to a subgraph pattern at the meta level (i.e., every node is abstracted by its type). ❑ NetTaxo conducts a motif instance-level selection to pick the most informative network structures for better topic taxonomy construction.

57 Outline

q Taxonomy Basics and Construction

q Parallel Concept Discovery: Entity Set Expansion

q Taxonomy Construction from Scratch

q Taxonomy Expansion

58 Taxonomy Enrichment: Motivation

q Why taxonomy enrichment instead of construction from scratch? q Already have a decent taxonomy built by experts and used in production q Most common terms are covered q New items (thus new terms) incoming everyday, cannot afford to rebuild the whole taxonomy frequently q Downstream applications require stable taxonomies to organize knowledge

59 Taxonomy Enrichment: Motivation

q Why taxonomy enrichment instead of construction from scratch? q Already have a decent taxonomy built by experts and used in production q Most common terms are covered q New items (thus new terms) incoming everyday, cannot afford to rebuild the whole taxonomy frequently q Downstream applications require stable taxonomies to organize knowledge q What is missing then? q Emerging terms take time for humans to discover q Long-tail / fine-grained terms (leaf nodes) are likely to be neglected

60 Three Assumptions in Taxonomy Expansion

q First, we assume each concept will have a textual name q Therefore, we can get the initial feature vector of each concept in the existing taxonomy and of each new concept q Second, we do not modify the existing taxonomy q Modification of existing relations happens less frequently and usually requires high cautiousness from human curators q Third, we focus on finding parent node(s) of each new concept q New concept’s parent node(s) typically appear in the existing taxonomy but its children node(s) may not exist the taxonomy

61 61 Taxonomy Expansion: Octet and TaxoExpan

q TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network [WWW’ 20] q Octet: Online Catalog Taxonomy Enrichment with Self-Supervision [KDD’ 20] q Two steps in solving the problem: q Self-supervised term extraction q Automatically extracts emerging terms from a target domain q Self-supervised term attachment q A multi-class classification to match a new node to its potential parent q Heterogenous sources of information (structural, semantic, and lexical) can be used 62 Self-supervised Term Extraction

q Octet adapts state-of-the-art sequence labeling method w. BiLSTM- CRF + Attention (Zheng et al, KDD’18) q Self-supervision q Use existing nodes as desired terms to be extracted q No human efforts needed

63 Self-supervised Term Attachment

q Octet combines structural, semantic and lexical representation to learn a term-pair representation and feeds it into a two-layer network. q Structural Representation: Interactions among taxonomy nodes, items, and queries q Semantic Representation: Word embedding-based features q Lexical Representation :Surface string-level features (Ends with, Contains, Suffix match, …)

64 Self-supervised Term Attachment

q TaxoExpan uses a matching score for each pair to indicate how likely the anchor concept is the parent of query concept q Key ideas: q Representing the anchor concept using its ego network (egonet) q Adding position information (relative to the query concept) into this egonet

“high dependency unit” gemb Query: “high dependency unit” Query Concept ni q

g position embeddings “hospital” gemb emb “area” hq ni “room” “hospital” grandparent Concatenation “room” (0) parent (K) query ha sibling ha a h(0) h(K) representation The ego nodes b b matching degree “room” b Ego Network f(ni, ai) “hospital room” gemb of Anchor “hospital (0) (K) c hc …… hc Concept ai “court”room” g hidden layers emb (0) (K) h d hd hd G ai “intensive “operating room” “hospital room” “intensive care unit” gemb e (0) (K) care unit” “gallery” he he anchor “low dependency unit” representation “low dependency unit” “classroom” 65 Graph Propagation Module Graph Readout Module Matching Model 65 Leveraging Existing Taxonomy for Self-supervised Learning q How to learn model parameters without relying on massive human- labeled data? q An intuitive approach True example 1 Step 2: select a local 1 True position sub-graph around true 2 3 4 position 3 Existing 6 , taxonomy 1 5 6 7 7 Step False example 2 3 4 randomly 3a 1 Step 3b: select a local : sub-graph around false Query node false select 1 position a position 5 6 7 2 3 4 4 6 , Step 1: randomly select a “query False position node” in the existing taxonomy 766 66 5 6 7 Octet Framework Analysis

Performance Trade-off Case studies

how many terms can be What if we relax the task attached if a specific precision to top-K prediction of term attachment is (instead of top-1 in Edge- required? F1)?

67 TaxoExpan Framework Analysis

q Case studies on MAG-CS and MAG-Full datasets

Query Predicted Parent Query Concept Predicted Parents (Top 2) “True” Parent Query Predicted Parent Query Concept Predicted Parents (Top 2) “True” Parent Concept = “True” Parent Concept = “True” Parent syndactyla ecology, biology zoology archival science library science email hacking internet privacy, hacker computer security hindi language linguistics symmetric matrix, programming social graph world wide web, the internet social network dyssodia botany m matrix matrix static library nonlinear system language two square cipher, 450 416 vigenere cipher cipher 6300 enriched food food science halton sequence hybrid monte carlo transposition cipher easy bruising medicine, surgery diabetes mellitus 5681 public educational computer science, criminology 4 aminoquinoline 1 organic chemistry, digital learning file record database intoxication biochemistry 400 technology information retrieval 5600 hexanoic acid oxide inorganic chemistry telecommunications, organic chemistry personality disorders, real time web world wide web channel signaling channel ester anxiety hysteria anxiety computer network anxiety disorder 350 link farm web search engine computer data storage, 4900 paracrystalline crystal solid state drive flash memory skype security computer security operating system bladder excision surgery matriarchal family kinship, sociology gender studies world wide web, metagame seven number ringer box telecommunications medline plus the internet game theory mathematics, percentile statistics 300 library science 4200 analysis summary artificial intelligence, 3706 computer vision, edge captcha internet privacy steerable filter image processing 250 computer security 3500 detection 232 Query Concept Predicted Parents (Top 2) “True” Parent Query Concept Predicted Parents (Top 2) “True” Parent 200 179 2800 2283 z order curve data structure, computer science skip list pc protocal computer security, network security ischemic preconditioning hardware obfuscation embedded system, hardware reverse engineering long variable interleaved memory, memory buffer transfer na 150 2100 boils DQGcarbuncles risk assessment, medical poisoning dataset blood staining staining, diabetes mellitus laryngeal mask airway 1661 resnet poly glycerol sebacate, hemp fibre deep learning java apple computer science, operating system syzygium 100 79 1400 Number of query concepts Number of query concepts

53 37 queries (≈1.5%) 783 183 queries (≈0.48%) 50 700 14 with rank ≥ 1000 208 with rank ≥ 10000 6 99 4 3 46 1 1 1 1 10 4 2 1 1 1 1 0 0 1 2 3 5 10 30 50 100 300 500 1000 3000 5000 10000 1 2 3 5 10 30 50 100 300 500 1000 5000 104 3*104 5*104 105 Rank of Query Concept’s “True” Parent Rank of Query Concept’s “True” Parent (a) MAG-CS Dataset (totally 2450 query concepts) (b) MAG-Full Dataset (totally 37804 query concepts) 68 68 References: Concept Expansion + Hierarchical Topic Modeling

❑ Xin Rong, Zhe Chen, Qiaozhu Mei, and Eytan Adar. “EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion.” (WSDM’16) ❑ Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, Jiawei Han, "SetExpan: Corpus-based Set Expansion via Context Feature Selection and Rank Ensemble", (ECMLPKDD'17) ❑ Jiaxin Huang, Yiqing Xie, Yu Meng, Jiaming Shen, Yunyi Zhang and Jiawei Han, ”Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion”, (WWW’20) ❑ Yunyi Zhang, Jiaming Shen, Jingbo Shang and Jiawei Han, “Empower Entity Set Expansion via Language Model Probing”, (ACL’20) ❑ David M. Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tenenbaum. 2003. Hierarchical Topic Models and the Nested Chinese Restaurant Process. In NIPS. ❑ David M. Blei, Thomas L. Griffiths, and Michael I. Jordan. 2010. The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies. Journal of ACM (2010). ❑ Wei Li and Andrew D McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In ICML. ❑ D. M. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In ICML, pages 633– 640, 2007

69 References: Taxonomy Construction and Expansion

❑ Li, Wei, David M. Blei and Andrew McCallum. “Nonparametric Bayes Pachinko Allocation.” UAI (2007) ❑ David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. ICML (2009) ❑ Su-Jin Shin and Il-Chul Moon. 2017. Guided HTM: Hierarchical Topic Model with Dirichlet Forest Priors. TKDE (2017) ❑ Zhang, Chao, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian M. Sadler, Michelle Vanni and Jiawei Han. “TaxoGen : Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering.” KDD (2018) ❑ Jiaming Shen, Zeqiu Wu, Dongming Lei, Chao Zhang, Xiang Ren, Michelle T. Vanni, Brian M. Sadler and Jiawei Han, "HiExpan: Task-Guided Taxonomy Construction by Hierarchical Tree Expansion", KDD (2018) ❑ Jiaxin Huang, Yiqing Xie, Yu Meng, Yunyi Zhang and Jiawei Han, “CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring”, KDD (2020) ❑ Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li and Jiawei Han, “NetTaxo: Automated Topic Taxonomy Construction from Large-Scale Text-Rich Network”, (WWW’20) ❑ Jiaming Shen, Zhihong Shen, Chenyan Xiong, Chi Wang, Kuansan Wang and Jiawei Han ”TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network”, (WWW’20) ❑ Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos and Jiawei Han, “Octet: Online Catalog Taxonomy Enrichment with Self-Supervision”, (KDD’20)

70 Q&A