Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
(based on our CSCW17’ paper)
Allen Yilun Lin1, Bowen Yu2, Andrew Hall2, Brent Hecht1
1. People, Space, and Algorithms (PSA) Computing Group, Northwestern University
2. GroupLens Research, University of Minnesota, Twin Cities
1 Article-as-Concept Assumption?
2 Article-as-Concept Assumption
Concepts Wikipedia article
Justin Trudeau Justin Trudeau
Angela Merkel Angela Merkel
Theresa May Theresa May
Images from: https://icons8.com/icon/10904/wikipedia 3 What’s the problem?
4 Example 1:
Compare multilingual Wikipedias
5 Images from: https://en.wikipedia.org/wiki/Donald_Trump 6 English wiki v.s. Chinese wiki
Images from: https://en.wikipedia.org/wiki/Donald_Trump 7 English wiki v.s. Chinese wiki
Family information
Images from: https://en.wikipedia.org/wiki/Donald_Trump 8 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%
Images from: https://en.wikipedia.org/wiki/Donald_Trump 9 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%
Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%
Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 10 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%
Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%
Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 11 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%
Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%
Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 12 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%
Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%
Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 13 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%
Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%
Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 14 Family info on English Wikipedia: # of words: 2340 % of the article: 13.91%
Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%
Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 15 Example 2:
Artificial Intelligence using Wikipedia
16 Images Credit: Brent Hecht 17 Milne-Witten Semantic Relatedness
Figure from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 200818 Milne-Witten Semantic Relatedness
Figure from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 200819 Milne-Witten Semantic Relatedness
inlinks outlinks Donald Trump
Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 20 Milne-Witten Semantic Relatedness
inlinks outlinks Donald Trump Vladimir Putin
Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 21 Milne-Witten Semantic Relatedness
Rex Tillerson G20 Michael Flynn
Republican KGB
US-Mexican border Kremlin
...... inlinks outlinks Donald Trump Vladimir Putin ...... 2014 Winter Trump organization Olympics
The Apprentice Economy of Russian
ExxonMobil APEC Most powerful people
Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 22 Images from: https://en.wikipedia.org/wiki/Donald_Trump 23 Images from: https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 24 Milne-Witten Semantic Relatedness
Rex Tillerson G20 Michael Flynn Republican KGB Kremlin US-Mexican border
...... inlinks Donald Trump Vladimir Putin outlinks ......
2014 Winter Trump organization Olympics
Economy of The Apprentice Russian Most powerful APEC ExxonMobil people
Annexation of Crimea 2014 Ukrainian Russian-US by Russia Revolution relations
Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 25 Milne-Witten Semantic Relatedness
Rex Tillerson G20 Michael Flynn Republican KGB Kremlin US-Mexican border
...... inlinks Donald Trump Vladimir Putin outlinks ......
2014 Winter Trump organization Olympics
Economy of The Apprentice Russian Most powerful APEC ExxonMobil people
Image source: https://www.christart.com/clipart/image/off-target 26 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Concepts Wikipedia article
Donald Trump Donald Trump
27 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Concepts Wikipedia article
Donald Trump Donald Trump
28 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Concepts Wikipedia article Family of Donald Trump
Foreign Policy of Trump Donald Trump Donald Trump Administration
… ...
29 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Concepts Wikipedia article Family of Donald Trump
Foreign Policy of Trump Donald Trump Donald Trump Administration
Parent article … ...
30 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Concepts Wikipedia article Family of Donald Trump
Foreign Policy of Trump Donald Trump Donald Trump Administration
Parent article … ...
Sub-article
31 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Concepts Wikipedia article Family of Donald Trump
Foreign Policy of Trump Donald Trump Donald Trump Administration
Parent article … ...
Sub-article
32 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Concepts Wikipedia article Family of Donald Trump
Foreign Policy of Trump Donald Trump Donald Trump Administration
Parent article … ...
Sub-article
33 Outline
1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discussion: audience-author mismatch 6. Summary
34 Step 1: Filter sub-article candidates
... > 5 million articles ...
Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/static/images/project-logos/enwiki.png 35 Reduce the # of comparisons?
36 Step 1: Filter sub-article candidates
Images from: https://en.wikipedia.org/wiki/Donald_Trump 37 Step 1: Filter sub-article candidates
{{Further|Trump Family}}
{{Main article|Family of Donald Trump}}
Images from: https://en.wikipedia.org/wiki/Donald_Trump 38 Step 1: Filter sub-article candidates
{{Further|Trump Family}}
{{Main article|Family of Donald Trump}}
Images from: https://en.wikipedia.org/wiki/Donald_Trump 39 Step 1: Filter sub-article candidates
Images from: https://en.wikipedia.org/wiki/Donald_Trump 40 Step 1: Filter sub-article candidates
Images from: https://en.wikipedia.org/wiki/Donald_Trump 41 Outline
1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discussion: audience-author mismatch 6. Summary
42 Step 2: Develop training datasets
Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions
43 Step 2: Develop training datasets
Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions
Bunbury, Western Jean Random Australia Pasqualini
List of tallest building in China
44 Step 2: Develop training datasets
Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions
Bunbury, Western Jean Random Australia Pasqualini
List of tallest building in China
Software engineering Ad-Hoc Geostatistics Feminism 45 Step 2: Develop training datasets
Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions
Bunbury, Western Jean Random Australia Pasqualini
List of tallest building in China
Software engineering Ad-Hoc Geostatistics Feminism 46 Outline
1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discussion: audience-author mismatch 6. Summary
47 Step 3: Classify sub-article relationship
Relative importance/centrality
E.g. IndegreeRatio =
Incoming link of parent article Incoming link of sub-article
48 Step 3: Classify sub-article relationship
Relative importance/centrality Implicit Linguistic
E.g. IndegreeRatio = E.g. MaxMainTFinSub =
Incoming link of parent article max{TF of parent article in Incoming link of sub-article sub-article’s summary}
49 Step 3: Classify sub-article relationship Parent article Sub-article candidate
{Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 50 Step 3: Classify sub-article relationship Parent article Sub-article candidate
{Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 51 Step 3: Classify sub-article relationship Parent article Sub-article candidate
{Donald Trump, Foreign policy of Donald Trump} = 2 Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 52 Step 3: Classify sub-article relationship
Relative importance/centrality Implicit Linguistic Explicit Template
E.g. IndegreeRatio = E.g. MaxMainTFinSub = E.g. MainTemplateRatio =
Incoming link of parent article max{TF of parent article in # langs using main template Incoming link of sub-article sub-article’s summary} # langs two articles coexists
53 Step 3: Classify sub-article relationship
Main template used in English and French Wikipedia
{Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump 54 Step 3: Classify sub-article relationship
Main template used in English and French Wikipedia Two articles coexist in 5 languages {Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump 55 Step 3: Classify sub-article relationship
Main template used in English and French Wikipedia Two articles coexist in 5 languages {Donald Trump, Foreign policy of Donald Trump} = 2/5 Images from: https://en.wikipedia.org/wiki/Donald_Trump 56 Accuracies
57 Accuracies
Donald Trump
Images from: https://en.wikipedia.org/wiki/Donald_Trump 58 Accuracies
Software Donald Trump Engineering
Images from: https://en.wikipedia.org/wiki/Donald_Trump Images from: https://en.wikipedia.org/wiki/Software_engineering 59 Accuracies
Software Bunbury, AU Donald Trump Engineering
Images from: https://en.wikipedia.org/wiki/Donald_Trump Images from: https://en.wikipedia.org/wiki/Software_engineering 60 Images from: https://en.wikipedia.org/wiki/Bunbury,_Western_Australia Performance: High-Interest dataset
Accuracies
61 Real impact of article-as-concept assumption?
62 Real impact of article-as-concept problem
63 Outline
1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discuss: audience-author mismatch 6. Summary
64 Discussion: audience-author mismatch
Audience
Communication
65 Discussion: audience-author mismatch
Audience Author
Communication
Images from: https://www.youtube.com/watch?v=ikkg4NobV_w 66 Discussion: audience-author mismatch
Audience Author
Communication
Wikipedia
Images from: https://www.youtube.com/watch?v=ikkg4NobV_w Images from: Wikimedia Commons 67 Images from: https://clubrunner.blob.core.windows.net/00000050227/Images/community_icon-300x300.png Article-as-concept in Wikidata
68 Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 69 Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 70 Images from: https://en.wikipedia.org/wiki/Donald_Trump Images from: https://www.wikidata.org/wiki/Property:P1269 71 Limitations and Future Works
● Sub-article relationships that are explicitly encoded by editors ○ In future, search for implicit relationship
72 Limitations and Future Works
● Sub-article relationships that are explicitly encoded by editors ○ In future, search for implicit relationship ● Good on High-Interest articles but struggled on random articles ○ In future, specific modelling on random articles
73 Limitations and Future Works
● Sub-article relationships that are explicitly encoded by editors ○ In future, search for implicit relationship ● Good on High-Interest articles but struggled on random articles ○ In future, specific modelling on random articles ● Integrate our system into Wikipedia-based technologies ○ E.g. MilneWitten
74 Summary
● Identified and problematized the article-as-concept assumption
75 Summary
● Identified and problematized the article-as-concept assumption ● Described the first model to address the problem
76 Summary
● Identified and problematized the article-as-concept assumption ● Described the first model to address the problem ● Discussed the audience-author mismatch in peer production
77 Summary
● Identified and problematized the article-as-concept assumption ● Described the first model to address the problem ● Discussed the audience-author mismatch in peer production ● We are releasing the system and the datasets
78 Images from: https://en.wikipedia.org/wiki/Thank_You | https://en.wikipedia.org/wiki/Q%26A Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia
Twitter: @bhecht, @cheetah90 Allen Yilun Lin1, Bowen Yu2, Andrew Hall2, Brent Hecht1 Email: [email protected] 1. People, Space, and Algorithms (PSA) Computing Group, Northwestern University
2. GroupLens Research, University of Minnesota, Twin Cities 79