Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia Yilun Lin Bowen Yu, Andrew Hall Brent Hecht Northwestern University University of Minnesota Northwestern University Evanston, Illinois Minneapolis, Minnesota Evanston, Illinois
[email protected] {bowen, hall}@cs.umn.edu
[email protected] ABSTRACT semantic web engines (e.g. [2,49,55]) to natural language Wikipedia-based studies and systems frequently assume that understanding systems (e.g. [16,37,52]). no two articles describe the same concept. However, in this paper, we show that this article-as-concept assumption is A key assumption in many Wikipedia-based studies and problematic due to editors’ tendency to split articles into systems is that there is a one-to-one mapping between a parent articles and sub-articles when articles get too long for concept and the Wikipedia article that describes the concept. readers (e.g. “Portland, Oregon” and “History of Portland, This assumption, which we call the article-as-concept Oregon” in the English Wikipedia). In this paper, we present assumption, supposes that the entire description of a given evidence that this issue can have significant impacts on concept in a given Wikipedia language edition can be found Wikipedia-based studies and systems and introduce the sub- in a single Wikipedia article. For example, under the article- article matching problem. The goal of the sub-article as-concept assumption, the entirety of the English matching problem is to automatically connect sub-articles to description of Portland, Oregon should be in the “Portland, parent articles to help Wikipedia-based studies and systems Oregon” article, and that article alone.