Mining Text Outliers in Document Directories ICDM 2020: 20Th IEEE International Conference on Data Mining Edouard Fouche´ ∗, Y

Mining Text Outliers in Document Directories ICDM 2020: 20Th IEEE International Conference on Data Mining Edouard Fouche´ ∗, Y

Mining Text Outliers in Document Directories ICDM 2020: 20th IEEE International Conference on Data Mining Edouard Fouche´ ∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ j October 19, 2020 ∗KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT), ∗∗UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN KIT – The Research University in the Helmholtz Association www.kit.edu This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Jobs Research Fun Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Jobs Research Fun Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Jobs Research Fun Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Conference? Jobs Research Fun Travel? Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... O Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Conference? Jobs Research Fun Travel? Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... O Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 This talk is about. Classifying documents into directories For example: emails, news articles, research papers . Conference? Jobs Research Fun Travel? Email 1: Re: Your application... Email 5: Fw: Grant Proposal... Email 7: Re: e best cat memes... Email 2: Interview scheduling... Email 6: Support Vector Machines... Email 8: ICDM Registration... O Email 3: Preparing your internship... Email 9: Get-together, RSVP... Email 4: Self-supervised methods... M Documents may be classified wrongly: Type M: Misclassification (wrong folder) Type O: Out-of-distribution (no adequate folder) We see those mistakes as semantic “outliers” We present an approach to mine both outliers types simultaneously Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 2/18 Why are there such outliers? (1/2) We drawning in data Document repositories have grown very large They tend to be highly multi-modal: many classes, folders Number of articles on ArXiv Number of CS articles on ArXiv 1.6e+06 math CV CY OH 2.0e+05 physics+gr-qc+nlin+nucl+quant-ph IT GT SD 1.4e+06 cond-mat 1.8e+05 LG IR MM astro-ph CL CC MA 1.2e+06 hep-* 1.5e+05 AI DM NA 1.0e+06 cs NI DB ET stat 1.2e+05 DS NE GR 8.0e+05 q-bio CR HC AR 1.0e+05 DC PL SC 6.0e+05 7.5e+04 LO CG PF RO DL MS 4.0e+05 5.0e+04 SI FL OS SE CE GL 2.0e+05 2.5e+04 SY 0.0e+00 0.0e+00 1991 1995 1999 2003 2007 2011 2015 2019 1991 1995 1999 2003 2007 2011 2015 2019 1 1 Data obtained from https://www.kaggle.com/Cornell-University/arxiv Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 3/18 Why are there such outliers? (2/2) Maintenance is difficult Human classification: sloppy, unreliable Handled by many/different users Different user ! Different classification 2 2 The Daily Struggle Meme – Jake Clark – Modified (ML/AI) Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 4/18 Why are they difficult to find? (1/2) Ambiguity Documents may have complex semantic Sometimes, the correct class is unknown yet, e.g., an emerging field Folder structures are domain/user-specific ! unsupervised Rabbit, duck, both?3 3 Rabbit and Duck – Fliegende Blatter,¨ 23 October 1892 – Public Domain Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 5/18 Why are they difficult to find? (2/2) Type O/M must be detected simultaneously Type O outliers may be detected as Type M by mistake The noise from Type M outliers hinders the detection of Type O Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 6/18 Why are they difficult to find? (2/2) Type O/M must be detected simultaneously O O Type O outliers may be detected as Type M by mistake The noise from Type M outliers hinders the detection of Type O Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 6/18 Why are they difficult to find? (2/2) Type O/M must be detected simultaneously M O M M M M O Type O outliers may be detected as Type M by mistake The noise from Type M outliers hinders the detection of Type O Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 6/18 Our Contributions We explore the problem of mining text outliers in document directories We are first to distinguish between Type O/M outliers We propose a new approach to detect text outliers kj-Nearest Neighbours (kj-NN) Exploit similarities between documents/phrases Extract semantic labels and similar documents ! interpretability We provide an extensive evaluation Improve the current state of the art (real-world/synthetic data) Interpretable results Code & data: https://github.com/edouardfouche/MiningTextOutliers Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 7/18 Our Contributions We explore the problem of mining text outliers in document directories We are first to distinguish between Type O/M outliers We propose a new approach to detect text outliers kj-Nearest Neighbours (kj-NN) Exploit similarities between documents/phrases Extract semantic labels and similar documents ! interpretability We provide an extensive evaluation Improve the current state of the art (real-world/synthetic data) Interpretable results Code & data: https://github.com/edouardfouche/MiningTextOutliers Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y. Meng∗∗, F. Guo∗∗, H. Zhuang∗∗, K. Bohm¨ ∗ & J. Han∗∗ – kj-NN October 19, 2020 7/18 Our Contributions We explore the problem of mining text outliers in document directories We are first to distinguish between Type O/M outliers We propose a new approach to detect text outliers kj-Nearest Neighbours (kj-NN) Exploit similarities between documents/phrases Extract semantic labels and similar documents ! interpretability We provide an extensive evaluation Improve the current state of the art (real-world/synthetic data) Interpretable results Code & data: https://github.com/edouardfouche/MiningTextOutliers Introduction Related Work Our Method Experiments Conclusions Edouard Fouche´∗, Y.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    49 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us