Mining Text Outliers in Document Directories ICDM 2020: 20Th IEEE International Conference on Data Mining Edouard Fouche´ ∗, Y
Total Page:16
File Type:pdf, Size:1020Kb
Mining Text Outliers in Document Directories ICDM 2020: 20th IEEE International Conference on Data Mining Edouard Fouche´ ∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ j October 19, 2020 ∗KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT), ∗∗UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN KIT – The Research University in the Helmholtz Association www.kit.edu This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Jobs Research Fun Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Jobs Research Fun Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Jobs Research Fun Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Conference? Jobs Research Fun Travel? Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... O Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Conference? Jobs Research Fun Travel? Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... O Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Conference? Jobs Research Fun Travel? Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... O Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 Why are there such outliers? (1/2) We drawning in data Document repositories have grown very large They tend to be highly multi-modal: many classes, folders Number of articles on ArXiv Number of CS articles on ArXiv 1.6e+06 math CV CY OH 2.0e+05 physics+gr-qc+nlin+nucl+quant-ph IT GT SD 1.4e+06 cond-mat 1.8e+05 LG IR MM astro-ph CL CC MA 1.2e+06 hep-* 1.5e+05 AI DM NA 1.0e+06 cs NI DB ET stat 1.2e+05 DS NE GR 8.0e+05 q-bio CR HC AR 1.0e+05 DC PL SC 6.0e+05 7.5e+04 LO CG PF RO DL MS 4.0e+05 5.0e+04 SI FL OS SE CE GL 2.0e+05 2.5e+04 SY 0.0e+00 0.0e+00 1991 1995 1999 2003 2007 2011 2015 2019 1991 1995 1999 2003 2007 2011 2015 2019 1 1 Data obtained from https://www.kaggle.com/Cornell-University/arxiv Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 3/18 Why are there such outliers? (2/2) Maintenance is difficult Human classification: sloppy, unreliable Handled by many/different users Different user ! Different classification 2 2 The Daily Struggle Meme – Jake Clark – Modified (ML/AI) Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 4/18 Why are they difficult to find? (1/2) Ambiguity Documents may have complex semantic Sometimes, the correct class is unknown yet, e.g., an emerging field Folder structures are domain/user-specific ! unsupervised Rabbit, duck, both?3 3 Rabbit and Duck – Fliegende Blatter,¨ 23 October 1892 – Public Domain Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 5/18 Why are they difficult to find? (2/2) Type O/M must be detected simultaneously Type O outliers may be detected as Type M by mistake The noise from Type M outliers hinders the detection of Type O Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 6/18 Why are they difficult to find? (2/2) Type O/M must be detected simultaneously O O Type O outliers may be detected as Type M by mistake The noise from Type M outliers hinders the detection of Type O Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 6/18 Why are they difficult to find? (2/2) Type O/M must be detected simultaneously M O M M M M O Type O outliers may be detected as Type M by mistake The noise from Type M outliers hinders the detection of Type O Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 6/18 Our Contributions We explore the problem of mining text outliers in document directories We are first to distinguish between Type O/M outliers We propose a new approach to detect text outliers kj-Nearest Neighbours (kj-NN) Exploit similarities between documents/phrases Extract semantic labels and similar documents ! interpretability We provide an extensive evaluation Improve the current state of the art (real-world/synthetic data) Interpretable results Code & data: https://github.com/edouardfouche/MiningTextOutliers Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 7/18 Our Contributions We explore the problem of mining text outliers in document directories We are first to distinguish between Type O/M outliers We propose a new approach to detect text outliers kj-Nearest Neighbours (kj-NN) Exploit similarities between documents/phrases Extract semantic labels and similar documents ! interpretability We provide an extensive evaluation Improve the current state of the art (real-world/synthetic data) Interpretable results Code & data: https://github.com/edouardfouche/MiningTextOutliers Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 7/18 Our Contributions We explore the problem of mining text outliers in document directories We are first to distinguish between Type O/M outliers We propose a new approach to detect text outliers kj-Nearest Neighbours (kj-NN) Exploit similarities between documents/phrases Extract semantic labels and similar documents ! interpretability We provide an extensive evaluation Improve the current state of the art (real-world/synthetic data) Interpretable results Code & data: https://github.com/edouardfouche/MiningTextOutliers Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y.