Problematizing and Addressing the Article-as-Concept Assumption in

(based on our CSCW17’ paper)

Allen Yilun Lin1, Bowen Yu2, Andrew Hall2, Brent Hecht1

1. People, Space, and Algorithms (PSA) Computing Group, Northwestern University

2. GroupLens Research, University of Minnesota, Twin Cities

1 Article-as-Concept Assumption?

2 Article-as-Concept Assumption

Concepts Wikipedia article

Justin Trudeau Justin Trudeau

Angela Merkel Angela Merkel

Theresa May Theresa May

Images from: https://icons8.com/icon/10904/wikipedia 3 What’s the problem?

4 Example 1:

Compare multilingual

5 Images from: https://en.wikipedia.org/wiki/Donald_Trump 6 English v.s. Chinese wiki

Images from: https://en.wikipedia.org/wiki/Donald_Trump 7 English wiki v.s. Chinese wiki

Family information

Images from: https://en.wikipedia.org/wiki/Donald_Trump 8 Family info on : # of words: 559 % of the article: 3.86%

Images from: https://en.wikipedia.org/wiki/Donald_Trump 9 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%

Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%

Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 10 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%

Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%

Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 11 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%

Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%

Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 12 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%

Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%

Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://zh.wikipedia.org/wiki/%E5%94%90%E7%B4%8D%C2%B7%E5%B7%9D%E6%99%AE | 13 Family info on English Wikipedia: # of words: 559 % of the article: 3.86%

Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%

Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 14 Family info on English Wikipedia: # of words: 2340 % of the article: 13.91%

Family info on Chinese Wikipedia: # of characters: 578 % of the article: 4.42%

Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 15 Example 2:

Artificial Intelligence using Wikipedia

16 Images Credit: Brent Hecht 17 Milne-Witten Semantic Relatedness

Figure from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 200818 Milne-Witten Semantic Relatedness

Figure from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 200819 Milne-Witten Semantic Relatedness

inlinks outlinks Donald Trump

Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 20 Milne-Witten Semantic Relatedness

inlinks outlinks Donald Trump Vladimir Putin

Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 21 Milne-Witten Semantic Relatedness

Rex Tillerson G20 Michael Flynn

Republican KGB

US-Mexican border Kremlin

...... inlinks outlinks Donald Trump Vladimir Putin ...... 2014 Winter Trump organization Olympics

The Apprentice Economy of Russian

ExxonMobil APEC Most powerful people

Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 22 Images from: https://en.wikipedia.org/wiki/Donald_Trump 23 Images from: https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 24 Milne-Witten Semantic Relatedness

Rex Tillerson G20 Michael Flynn Republican KGB Kremlin US-Mexican border

...... inlinks Donald Trump Vladimir Putin outlinks ......

2014 Winter Trump organization Olympics

Economy of The Apprentice Russian Most powerful APEC ExxonMobil people

Annexation of Crimea 2014 Ukrainian Russian-US by Russia Revolution relations

Figure inspired from Witten, Ian, and David Milne. "An effective, low-cost measure of semantic relatedness obtained from Wikipedia links." AAAI Workshop 2008 25 Milne-Witten Semantic Relatedness

Rex Tillerson G20 Michael Flynn Republican KGB Kremlin US-Mexican border

...... inlinks Donald Trump Vladimir Putin outlinks ......

2014 Winter Trump organization Olympics

Economy of The Apprentice Russian Most powerful APEC ExxonMobil people

Image source: https://www.christart.com/clipart/image/off-target 26 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Concepts Wikipedia article

Donald Trump Donald Trump

27 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Concepts Wikipedia article

Donald Trump Donald Trump

28 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Concepts Wikipedia article Family of Donald Trump

Foreign Policy of Trump Donald Trump Donald Trump Administration

… ...

29 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Concepts Wikipedia article Family of Donald Trump

Foreign Policy of Trump Donald Trump Donald Trump Administration

Parent article … ...

30 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Concepts Wikipedia article Family of Donald Trump

Foreign Policy of Trump Donald Trump Donald Trump Administration

Parent article … ...

Sub-article

31 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Concepts Wikipedia article Family of Donald Trump

Foreign Policy of Trump Donald Trump Donald Trump Administration

Parent article … ...

Sub-article

32 Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Concepts Wikipedia article Family of Donald Trump

Foreign Policy of Trump Donald Trump Donald Trump Administration

Parent article … ...

Sub-article

33 Outline

1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discussion: audience-author mismatch 6. Summary

34 Step 1: Filter sub-article candidates

... > 5 million articles ...

Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/static/images/project-logos/enwiki.png 35 Reduce the # of comparisons?

36 Step 1: Filter sub-article candidates

Images from: https://en.wikipedia.org/wiki/Donald_Trump 37 Step 1: Filter sub-article candidates

{{Further|Trump Family}}

{{Main article|Family of Donald Trump}}

Images from: https://en.wikipedia.org/wiki/Donald_Trump 38 Step 1: Filter sub-article candidates

{{Further|Trump Family}}

{{Main article|Family of Donald Trump}}

Images from: https://en.wikipedia.org/wiki/Donald_Trump 39 Step 1: Filter sub-article candidates

Images from: https://en.wikipedia.org/wiki/Donald_Trump 40 Step 1: Filter sub-article candidates

Images from: https://en.wikipedia.org/wiki/Donald_Trump 41 Outline

1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discussion: audience-author mismatch 6. Summary

42 Step 2: Develop training datasets

Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions

43 Step 2: Develop training datasets

Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions

Bunbury, Western Jean Random Australia Pasqualini

List of tallest building in

44 Step 2: Develop training datasets

Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions

Bunbury, Western Jean Random Australia Pasqualini

List of tallest building in China

Software engineering Ad-Hoc Geostatistics Feminism 45 Step 2: Develop training datasets

Donald Trump Melania Trump High-Interest La La Land Super Bowl Champions

Bunbury, Western Jean Random Australia Pasqualini

List of tallest building in China

Software engineering Ad-Hoc Geostatistics Feminism 46 Outline

1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discussion: audience-author mismatch 6. Summary

47 Step 3: Classify sub-article relationship

Relative importance/centrality

E.g. IndegreeRatio =

Incoming link of parent article Incoming link of sub-article

48 Step 3: Classify sub-article relationship

Relative importance/centrality Implicit Linguistic

E.g. IndegreeRatio = E.g. MaxMainTFinSub =

Incoming link of parent article max{TF of parent article in Incoming link of sub-article sub-article’s summary}

49 Step 3: Classify sub-article relationship Parent article Sub-article candidate

{Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 50 Step 3: Classify sub-article relationship Parent article Sub-article candidate

{Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 51 Step 3: Classify sub-article relationship Parent article Sub-article candidate

{Donald Trump, Foreign policy of Donald Trump} = 2 Images from: https://en.wikipedia.org/wiki/Donald_Trump | https://en.wikipedia.org/wiki/Foreign_policy_of_the_Donald_Trump_administration 52 Step 3: Classify sub-article relationship

Relative importance/centrality Implicit Linguistic Explicit Template

E.g. IndegreeRatio = E.g. MaxMainTFinSub = E.g. MainTemplateRatio =

Incoming link of parent article max{TF of parent article in # langs using main template Incoming link of sub-article sub-article’s summary} # langs two articles coexists

53 Step 3: Classify sub-article relationship

Main template used in English and

{Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump 54 Step 3: Classify sub-article relationship

Main template used in English and French Wikipedia Two articles coexist in 5 languages {Donald Trump, Foreign policy of Donald Trump} Images from: https://en.wikipedia.org/wiki/Donald_Trump 55 Step 3: Classify sub-article relationship

Main template used in English and French Wikipedia Two articles coexist in 5 languages {Donald Trump, Foreign policy of Donald Trump} = 2/5 Images from: https://en.wikipedia.org/wiki/Donald_Trump 56 Accuracies

57 Accuracies

Donald Trump

Images from: https://en.wikipedia.org/wiki/Donald_Trump 58 Accuracies

Software Donald Trump Engineering

Images from: https://en.wikipedia.org/wiki/Donald_Trump Images from: https://en.wikipedia.org/wiki/Software_engineering 59 Accuracies

Software Bunbury, AU Donald Trump Engineering

Images from: https://en.wikipedia.org/wiki/Donald_Trump Images from: https://en.wikipedia.org/wiki/Software_engineering 60 Images from: https://en.wikipedia.org/wiki/Bunbury,_Western_Australia Performance: High-Interest dataset

Accuracies

61 Real impact of article-as-concept assumption?

62 Real impact of article-as-concept problem

63 Outline

1. Motivation and problem 2. Step 1: Filter sub-article candidates 3. Step 2: Develop training datasets 4. Step 3: Classify sub-article relationship 5. Discuss: audience-author mismatch 6. Summary

64 Discussion: audience-author mismatch

Audience

Communication

65 Discussion: audience-author mismatch

Audience Author

Communication

Images from: https://www.youtube.com/watch?v=ikkg4NobV_w 66 Discussion: audience-author mismatch

Audience Author

Communication

Wikipedia

Images from: https://www.youtube.com/watch?v=ikkg4NobV_w Images from: Wikimedia Commons 67 Images from: https://clubrunner.blob.core.windows.net/00000050227/Images/community_icon-300x300.png Article-as-concept in Wikidata

68 Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 69 Images from: https://en.wikipedia.org/wiki/Family_of_Donald_Trump 70 Images from: https://en.wikipedia.org/wiki/Donald_Trump Images from: https://www.wikidata.org/wiki/Property:P1269 71 Limitations and Future Works

● Sub-article relationships that are explicitly encoded by editors ○ In future, search for implicit relationship

72 Limitations and Future Works

● Sub-article relationships that are explicitly encoded by editors ○ In future, search for implicit relationship ● Good on High-Interest articles but struggled on random articles ○ In future, specific modelling on random articles

73 Limitations and Future Works

● Sub-article relationships that are explicitly encoded by editors ○ In future, search for implicit relationship ● Good on High-Interest articles but struggled on random articles ○ In future, specific modelling on random articles ● Integrate our system into Wikipedia-based technologies ○ E.g. MilneWitten

74 Summary

● Identified and problematized the article-as-concept assumption

75 Summary

● Identified and problematized the article-as-concept assumption ● Described the first model to address the problem

76 Summary

● Identified and problematized the article-as-concept assumption ● Described the first model to address the problem ● Discussed the audience-author mismatch in peer production

77 Summary

● Identified and problematized the article-as-concept assumption ● Described the first model to address the problem ● Discussed the audience-author mismatch in peer production ● We are releasing the system and the datasets

78 Images from: https://en.wikipedia.org/wiki/Thank_You | https://en.wikipedia.org/wiki/Q%26A Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

Twitter: @bhecht, @cheetah90 Allen Yilun Lin1, Bowen Yu2, Andrew Hall2, Brent Hecht1 Email: [email protected] 1. People, Space, and Algorithms (PSA) Computing Group, Northwestern University

2. GroupLens Research, University of Minnesota, Twin Cities 79