S o c i e t y O n line, Part 2

Introducing New Features to : Case Studies for Web Science

Mathias Schindler, Wikimedia Deutschland

Denny Vrandecˇic´, Karlsruhe Institute of Technology

ikipedia is a free Web-based encyclopedia produced by hundreds Wof thousands of online contributors.1 Today, Wikipedia offers more Introducing new than 10 million articles and is available in more than 250 languages. For

features to Wikipedia some of these languages, Wikipedia is not just free, but the only encyclopedia

is a complex available. Currently, it is among the 10 feature implementations suggest that the most-visited websites in the world, with better we understand the complex interac- sociotechnical more than 50,000 hits per second during tion between technical progress and social peak times. adaption, the more likely the introduction process. The authors Introducing new technical features to of a new technology to Wikipedia will be Wikipedia is thus a complex sociotechni- successful. compare the Web cal process. Wikipedia runs on the open Based on previous implementations of source MediaWiki wiki engine. Because new features—including the category sys- science process Wiki­pedia’s technical implementation and tem in 2004, parser functions in 2006, and content development are both done in free, flagged revisions in 2008—this article dis- to the previous open projects that include complete change cusses the impact of introducing a further logs, we can trace and analyze the codevel- new feature: creating semantic annotations. introduction of new opment of the technical and social aspects. We show how the interaction between the (MediaWiki’s source code repository is ac- technical features and the community is an features and suggest cessible at http://svn.wikimedia.org, and the instantiation of the Web science process, database dumps of Wikipedia’s complete a model that describes the evolution of the how to use it as a log history are available at http://download. Web and its systems. wikipedia.org.) model for the future Because of Wikipedia’s open nature and Previous Wikipedia Upgrades abundant and freely available project and According to the Web science research pro- development of history data, researchers often use it to study cess, to resolve issues, engineers come up with collaborative content creation. Both avail- new ideas and then implement them. Because Wikipedia. able research and the examples of previous the Web is an inherently social infrastructure,

56 1541-1672/11/$26.00 © 2011 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society

IS-26-01-Vran.indd 56 25/01/11 12:06 PM an idea’s technical implementation • the linked page itself behaves like category tree extensions, which ren- will necessarily involve social aspects. a normal article and displaying the der a tree of subcategories on a cat- If done well, the idea’s sociotechnical backlinks requires a second click egory page. The introduction and implementation can resolve the origi- on the appropriate link in the tool- application of the categories were nal issue on a micro level. However, bar, and heatedly debated in some language because of the Web’s size and com- • normal articles (such as the one communities, a few of which enforced plexity, such solutions often have un- for Greece) cannot be used as cat- restrictions on the use of categories expected implications on the macro egory pages because all normal with manual effort. The German level. Engineers must in turn analyze links then also appear in the list, Wikipedia community agreed on a these implications to identify new is- not just those that were introduced moratorium on category use to first sues (such as situations that do not for categorizing. (That is, Albania define some guidelines for their ap- conform to certain values), thus start- would be in the list of pages linking plication. The community members ing the process anew. (A detailed de- to Greece because it’s a bordering manually enforced the guidelines and scription of the steps in the process country.) the moratorium. was presented in an earlier work.2) At the time, the power to develop, Tim Berners-Lee has called the com- Wikipedia often avoided such wiki- enable, and disable new features lay plexity step from the micro solution to specific idiosyncrasies. For example, basically with the software develop- the macro solution one of the magics Wikipedia omitted camel-case syn- ers. The communities around the dif- of Web science, and he called the cre- tax in favor of free links, so it is often ferent language had no ativity to get from issues to ideas the described as an atypical wiki. option but to accept these changes second magic of Web science.3 The category system was intro- and deal with the aftermath. Tech- To illustrate this process, we de- duced to MediaWiki in 2004 to solve nical details, such as the fact that re- scribe examples from the history of this problem. Each article could be directs on categories did not mean Wikipedia, tracing and analyzing put into an arbitrary number of cat- that a contributor could use the re- them from the available data about egories that are identified by freely directing category as a synonym for the Wikipedia project and its history. chosen names. Adding a page to the categorization, had enormous social category “Greece” is done by add- impact because they formed the pat- Category System ing [[Category:Greece]] to the terns of interaction with the software. Following Wikipedia’s initial growth, wiki text. Category links themselves (A more thorough analysis of the cat- it became necessary to introduce fea- are not displayed in the article text, egory system is available elsewhere.4) tures to help users explore and navi- but in separate visual elements on gate the site’s content. Early versions page rendering. The category page Parser Functions of the MediaWiki software offered is a distinct page in the newly intro- Developers use templates to include only the manually created links, a duced category namespace and thus the text of another page (usually in backlinks feature (which displays all separate from the article space. Cat- the separate template namespace) pages that link to a specific page), egory pages list all pages belonging where the template is being called. and a full-text search to help users to a specific category. The new cate- This allows for higher consistency discover content in the encyclopedia. gory system resolved all the identified within the wiki, allowing some ele- In many other wikis, the backlinks technical issues of using backlinks for ments to be written only once and feature was used to implement a ru- categorizing articles. reused in several articles. Templates dimentary tagging system—that is, The category system was activated can also feature parameters—that is, once a link was created to a page such in all language editions at the same parametrized template calls will in- as “Greece-related topic” on all rel- time without consulting the respec- sert the parameter values in the re- evant pages, the list of backlinks on tive Wikipedia communities. For placing text (see Figure 1). Such tem- the page “Greece-related topic” was a most of the languages, it was quickly plates are widely used in Wikipedia; list of all pages on that topic. This ap- applied to categorize a majority of the for example, the proach has several disadvantages: existing pages. Soon new issues were offers more than 200,000 templates. discovered, however, such as how to In 2006, some Wikipedians dis- • the links to the topic page are dis- organize categories themselves. This covered that through an intricate and played in the text like any other link, led to further new features such as complicated interplay of templating

Ja nuary/February 2011 www.computer.org/intelligent 57

IS-26-01-Vran.indd 57 25/01/11 12:06 PM S o c i e t y O n line, Part 2

Greece is {{Country | Continent=Europe | Capital=Athens}}

(a) were assigned at least one flagged re- a country in {{{Continent}}}. Its capital is {{{Capital}}}. vision. The remaining work is to keep [[Category:Country]] up with edits by unregistered or new users. Any Wikipedia author who is a (b) member of the editor group can mark a revised article as compliant with a Greece is a country in Europe. Its capital is Athens. set of conditions on content quality. [[Category:Country]] The current condition for such a flag is the “lack of obvious forms of van- (c) dalism,” a deliberately low thresh- old. Unregistered Wikipedia visi- Figure 1. Template use in Wikipedia. (a) The source of a page about Greece calling tors (the vast majority of Wikipedia the country template, (b) the country template source, and (c) the text of the page about Greece after template expansion. visitors) are shown the latest flagged revision instead of the most recent features and cascading style sheets simple writing of extension functions article version. (CSSs), they could create conditional to add arbitrary functionalities, such Both prior to implementation and wiki text—text that was displayed if as geocoding services or widgets. after the initial start of this feature, a template parameter had a specific The developers were clearly react- a lengthy debate was held within value. This included repeated calls ing to the community’s demands, be- the author community, resulting in for templates within templates, which ing forced either to fight the solution a straw poll. The preliminary result bogged down the whole system’s per- to the community’s issue or offer an was a rather strong mandate to keep formance. As a result, the develop- improved technical implementation the feature. Other language projects ers had to either disallow the spread to replace the previous practice and were given the choice to use this ex- of an obviously desired feature by achieve overall better performance. tension (newspapers are following detecting such usage and explicitly the discussion on the English Wiki- disallowing it within the software, Flagged Revisions pedia). Developers and the commu- or offer an efficient alternative. Tim Because anyone can edit the pages nity worked closely to realize this Starling offered such an alternative in a wiki-based encyclopedia, arti- solution. This feature is expected to when he introduced parser functions cle defacing can and does occur. The make a strong impression on casual (wiki text that calls functions imple- first approach to dealing with this Wikipedia readers by ensuring that mented in the underlying software) in problem was to protect articles that no spam or profanity is displayed to April 2006: tended to be defaced often (such as them. Based on the exhaustive de- the main index page and pages for bate, the feature should effectively In response to a campaign by users of George W. Bush and abortion), only confine vandalism, and as of now, the English Wikipedia to harass devel- letting users with administrator the feature seems to be functioning as opers by introducing increasingly ugly rights edit them. Naturally, this ap- intended. and inefficient meta-templates to popu- proach decreases “wikiness.” As a lar pages, I’ve caved in and written a few result, Wikipedia introduced semi- Further Examples reasonably efficient parser functions. protection, making articles editable Select other features can also pro- There are two conditional functions and only by registered users. Preliminary vide worthwhile examples for fur- a mathematical expression function.5 research indicates that semi-protection ther investigation. For example, some greatly reduces risk of vandalism but early extensions might be surprising At first, only conditional text and does not severely affect the collabora- in their scope. In one example, the the computation of simple math- tive process. Hiero­wiki extension let contribu- ematical expressions were imple- Flagged revisions enable a second tors use hieroglyphs within wiki text. mented, but even this increased the layer to store the ad hoc validation of This extension was created and advo- possibilities for wiki editors enor- pages. After a trial phase, this became cated by a small user group, but it was mously. In time, further parser func- a permanent feature in the German swiftly implemented and integrated tions were introduced, finally lead- language Wikipedia, and in Febru- because of the early state of the over- ing to a framework that allowed the ary 2009, all articles in this edition all project. This functionality was

58 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-26-01-Vran.indd 58 25/01/11 12:06 PM Greece is a [[Europe]]an country. It’s capital is [[Athens]]. [[Category:Country]] later refactored out of the core with- out affecting the usage in the Wiki­ (a) pedia content. The cite extension, created by core Greece is a [[continent::Europe]]an country. developer Magnus Manske, lets con- It’s capital is [[capital::Athens]]. tributors add citation information to [[Category:Country]] an article and render the citations ap- propriately. The cite extension intro- (b) duces new, XML-inspired syntax to the wiki text, which is inconsistent Greece is a [[Europe]]an country. with the rest of the wiki syntax. This It’s capital is [[Athens]]. [[Category:Country]] inconsistent syntax did not bother [[Continent::Europe]] the community, who quickly adopted [[Capital::Athens]] and extended the possibilities of the new extension. (c) The MediaWiki software is further extending the possibility of personal- Figure 2. Source for the Wikipedia page about Greece. (a) Plain MediaWiki, ized views for logged-in users. Within (b) Semantic MediaWiki, and (c) Semantic MediaWiki using category-style annotations. the wiki, users can add new Java­ Script and CSS commands that will be announced a For example, on the Greece page, the served with the HTML pages. Some partnership with PediaPress, a com- country template can be called using users offer JavaScript snippets that mercial partner that would create and the parameter Capital: {{Country| can be integrated into their own Java­ maintain the software to create the Capital=Athens}}. Script extension page. This frame- output. This demonstrates that mod- The template can now create a work allows for almost arbitrary ad- ules developed outside of the Media­ consistent layout for country pages, aptations of the individual Wikipedia Wiki open source developer commu- for example, by including an info- experience, for example, by adding a nity can be integrated into Wikipedia. box displaying the capital. In Se- more powerful editor or hiding parts mantic MediaWiki, the template can of the page output. The widget exten- Semantic Annotations also annotate the country article sion takes this a step further, letting The Semantic MediaWiki project with the capital, thus moving the ad- less technically inclined users access lets contributors add structured ditional annotation syntax from the some of these personalization fea- data to wiki pages and exploit these article to the template. This shifts tures as well. These extensions decou- both within the wiki through query- the burden from the casual Wiki­ ple the users’ individual experiences ing and browsing and outside of the pedia editor to people who are com- from the community and lets them wiki using the Resource Description fortable with editing the complex individually enhance Wikipedia with Framework (RDF) export of the an- template syntax. This is not a tech- new features. notation structure.6 To create the nically enforced trade-off, however; A recurring topic among devel- structured data, a MediaWiki syntax the annotation can be in either the opers and active users has been the extension is employed to allow for article or template. The community conversion of wiki text or rendered typed links and further annotations. must choose how to enforce anno- HTML output to PDF. Early at- On the page for Greece, for example, tated links. tempts were mostly based on manual the wiki text [[Capital::Athens]] Instead of annotating links, the work (such as the WikiReader proj- would be interpreted as the triple community might choose to put ect), involving OpenOffice templates Greece-Capital-Athens. the annotations at the bottom of the and their PDF output. These projects It was soon noted that these an- page, similar to how category anno- triggered several pieces of software notations make the wiki text harder tations are used today. So, articles that implemented a MediaWiki-to- to read and might confuse and dis- would still include the same annota- PDF converter or a different way to courage novice contributors. To solve tion, but with no link at that loca- create more visually appealing Wiki- this, editors often moved the anno- tion. This decouples the annotations pedia articles for print. In 2008, the tations from the text to templates. from the text (see Figure 2). Figure 2b

Ja nuary/February 2011 www.computer.org/intelligent 59

IS-26-01-Vran.indd 59 25/01/11 12:06 PM S o c i e t y O n line, Part 2

T h e A u thor s

Mathias Schindler is a project manager at Wikimedia Deutschland. His research inter- ests include authority files and bibliometrics. Schindler studied at the University of Frank- descriptive power for both the a pos- furt. Contact him at [email protected]. teriori analysis and a priori design of

Denny Vrandecˇic’ is a postdoctoral researcher at the Karlsruhe Institute of Technology, new Wikipedia features. Our current Germany. His research interests include massive collaboration and the Semantic Web. plan for the eventual implementa- Vrandecic has a PhD from the Karlsruhe Institute of Technology. Contact him at denny. tion of Semantic MediaWiki in Wiki- [email protected]. pedia is to openly discuss the differ- ent alternatives with the community and let them decide on the actual shows the links annotated in the text, but it would eventually lead to implementation. whereas Figure 2c shows all annota- larger changes in how the commu- tions at the end. The decoupling has nity applies templates. a different trade-off: Having the links Acknowledgments annotated in the text increases con- This work was partially funded by the sistency (because changing the capi- e are convinced that we can European Union under EU IST FP7 project Active (www.active-project.eu). tal of Greece from Athens to Sparta Wlearn from analyzing the would change both the text and the introduction of new features in Wiki- annotation), while having the anno- pedia. Specifically, this preliminary References tated links at the end increases read- survey offers the following lessons: 1. P. Ayers, C. Matthews, and B. Yates, ability. Again, this is a social choice, How Wikipedia Works, No Starch and the technology allows for both • The lack of some technical fea- Press, 2008. options, simplifying editing or in- tures can lead to the misuse of 2. J. Hendler et al., “Web Science: An creasing consistency. But this might other existing features to simulate Interdisciplinary Approach to Under- only be the microview of the issue; the desired functionality (as exem- standing the Web,” Comm. ACM, that is, simplified editing could in plified by the use of backlinks for vol. 51, no. 7, 2008, pp. 60–69. turn lead to increased community categories). 3. T. Berners-Lee, “The Two Magics of participation, which might eventually • Common practices within a com- Web Science,” keynote speech, 16th lead to more consistency due to an in- munity might necessitate changes Int’l World Wide Web (WWW) Conf., creased number of eyes. It is difficult in the technical platform to 2007, www.w3.org/2007/Talks/0509- to predict the sociotechnical impact meet technical constraints without www-keynote-tbl. of such a choice. changing the practice of the com- 4. J. Voss, “Collaborative Thesaurus Tag- The DBpedia project offers an munity too much (as seen for parser ging the Wikipedia Way,” tech. report alternative approach to extracting functions). abs/cs/0604036, Computing Research data from Wikipedia.7 DB­pedia • Introducing a new feature might Repository, 2006. extracts data from the existing lead to a huge perceived change in 5. T. Starling, “Mathematical Expressions templates without changing Wiki- the mission and scope of the whole and Conditional Constructs, Now Imple- pedia’s technical component at all. system (a challenge posed by the mented,” 2006; http://lists.wikimedia. Instead, it creates a strong incen- flagged revisions). org/pipermail/wikitech-l/2006-April/ tive for standardizing and con- • In a sociotechnical system, imper- 022341.html. trolling the use of templates with fections in the underlying tech- 6. M. Krötzsch et al., “Semantic Wiki­ much more scrutiny. The success nical platform can be compen- pedia,” J. Web Semantics, vol. 5, 2007, of DBpedia might lead to more sated by the community (as with pp. 251–261. pressure to change how templates the unusual syntax of the cite 7. S. Auer et al., “DBpedia: A Nucleus for are used to improve information extension). a Web of Open Data,” Proc. 6th Int’l extraction, thus pushing for more • A priori design of new features Semantic Web Conf., Springer, 2007, of a social change than the tech- must resolve the technical chal- pp. 722–735. nical change that Semantic Media­ lenges while also considering pos- Wiki requires. This approach also sible social implications. involves bigger technical changes Selected CS articles and columns by implementing Semantic Media­ Based on these findings, we have are also available for free at Wiki on the Wikipedia servers, used the Web science process model’s http://ComputingNow.computer.org.

60 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-26-01-Vran.indd 60 25/01/11 12:06 PM IS-26-01-Vran.indd 61 25/01/11 12:06 PM