Neural Article Pair Modeling for Wikipedia Sub-article Matching? Muhao Chen1, Changping Meng2, Gang Huang3, and Carlo Zaniolo1 1 University of California, Los Angeles, CA, USA fmuhaochen,
[email protected] 2 Purdue University, West Lafayette, IN, USA
[email protected] 3 Google, Mountain View, CA, USA
[email protected] Abstract. Nowadays, editors tend to separate different subtopics of a long Wiki- pedia article into multiple sub-articles. This separation seeks to improve human readability. However, it also has a deleterious effect on many Wikipedia-based tasks that rely on the article-as-concept assumption, which requires each entity (or concept) to be described solely by one article. This underlying assumption significantly simplifies knowledge representation and extraction, and it is vi- tal to many existing technologies such as automated knowledge base construc- tion, cross-lingual knowledge alignment, semantic search and data lineage of Wikipedia entities. In this paper we provide an approach to match the scattered sub-articles back to their corresponding main-articles, with the intent of facilitat- ing automated Wikipedia curation and processing. The proposed model adopts a hierarchical learning structure that combines multiple variants of neural docu- ment pair encoders with a comprehensive set of explicit features. A large crowd- sourced dataset is created to support the evaluation and feature extraction for the task. Based on the large dataset, the proposed model achieves promising results of cross-validation and significantly outperforms previous approaches. Large-scale serving on the entire English Wikipedia also proves the practicability and scala- bility of the proposed model by effectively extracting a vast collection of newly paired main and sub-articles.