Introducing new features to

– Case studies for Web Science –

Mathias Schindler Denny Vrandeciˇ c´ Wikimedia Deutschland e.V. Institute AIFB Deutsche Nationalbibliothek Universität Karlsruhe(TH) Germany Germany [email protected] [email protected]

ABSTRACT Wikipedia is a free web-based . It is written in col- laboration by hundred of thousands of contributors [2]. It runs on the Open Source MediaWiki engine. Introducing new features to Wikipedia is not just a technical question, but a complex socio- technical process. Previous introductions of new features (the cat- egory system in 2004, parser functions in 2006, and flagged revi- sions in 2008) are described and analyzed. Based on these experi- ences, the design of a new feature (creating semantic annotations) is given and discussed. The interaction between the technical fea- tures and the community is shown to be an instantiation of the Web Science research cycle, thus testing the cycle as a methodological tool for web science.

1. INTRODUCTION Wikipedia is a free web-based encyclopedia. It is written in col- laboration by hundred of thousands of contributors [2]. Today, Wikipedia is available in more than 250 languages offering more than 10 million articles. For some of these languages, Wikipedia is not only a free encyclopedia, but also the only encyclopedia avail- able in this language. Currently it is among the ten most visited Figure 1: The web science process as presented by Tim web sites in the world, sporting more than 50 thousand hits per Berners-Lee in his Keynote The two Magics of Web Science second in the peak time. at the WWW2007, Banff, Canada. The two magics are creativity and complexity, since both are not well under- The introduction of new technical features to Wikipedia has al- stood yet. Source: http://www.w3.org/2007/Talks/ ways turned out to be a complex socio-technical process. Since 0509-www-keynote-tbl/#(11) both the technical implementation and the content development of Wikipedia are done in free and open projects that both offer their 1 complete change logs, the co-development of the technical and aspects. If done well, the socio-technical implementation of the social aspects can be traced and analyzed. idea may resolve the original issue on a micro-level. But due to the size and the complexity of the web, the solution will have novel and Section 2 offers a short overview of the web science process, before unexpected implications on the macro-level. These in turn have to using it to describe the introduction of new features to Wikipedia be analyzed in order to identify new issues (i.e. situations that do in four case studies: the category system, introduced in 2004 (Sec- not conform to certain values), thus starting the web science pro- tion 3.1); parser functions, introduced in 2006 (Section 3.2); flagged cess anew. A detailed description of the steps of the process is given revisions, introduced in 2008 (Section 3.3); and semantic annota- in [4]. Figure 1 gives an graphical overview of the model [3]. tion, currently being proposed (Section 4). We close the paper in Section 5 with conclusions drawn from the case studies. The authors already identify a number of examples for the web science process. In this paper we describe further examples from 2. WEB SCIENCE PROCESS the , that can be fully traced and analyzed due The web science process [4] is a model to describe the evolution to the availability of the data about the Wikipedia project and its of the web and of systems on the web. In order to resolve issues, history. engineers creatively come up with new ideas. These are then im- plemented. Since the web is an inherently social infrastructure, the technical implementation of an idea will necessarily involve social 3. PREVIOUS EXPERIENCES 1MediaWiki’s SVN is accessible at http://svn.wikmedia. This section describes three case studies of previously introduced org, the database dumps of Wikipedia’s complete log history can features to the Wikipedia system. Section 3.4 notes some other be found at http://download.wikipedia.org examples that may grant further studies. 3.1 Category system Greece is {{Country | Continent=Europe (a) Following the initial growth of Wikipedia, features to improve ex- | Capital=Athens}} ploring and navigating the content became necessary. Early ver- a country in {{{Continent}}}. It’s sions of the MediaWiki offered basically only the man- capital is {{{Capital}}}. ually created links, a backlinks feature, and a full text search in (b) order to discover the content of the encyclopedia. In many other [[Category:Country]] the backlinks feature (that displays all pages that link to a specific page) was used to implement a rudimentary tagging sys- Greece is a country in Europe. It’s capital is Athens. tem: on creating a link to a page like "Greece-related topic" on all (c) pages that discuss topics related to Greece, the list of backlinks on the page "Greece-related topic" will be a list of all pages on that [[Category:Country]] topic. This approach has a number of disadvantages: the links to the topic page will behave like any other link (i.e. display in the Figure 2: Usage of templates. (a) Source of a page about text), the linked page itself will behave like a normal article (i.e. Greece calling the country template; (b) Source of the coun- it will be a normal page, and displaying the backlinks will require try template; (c) Text of the page about Greece after template- a second click on the appropriate link in the toolbar), and normal expansion. articles like "Greece" can not be used as category pages (since all normal links would also appear in the list, not just those that were introduced for categorizing: Albania would be in the list of pages 3.2 Parser functions linking to Greece, since it is would link to it, being a bordering Templates are used to include the text of another page (usually in country). Even though this is the usual procedure in many other the separate Template namespace) at the place were the template wikis, Wikipedia often avoided such wiki-specific idiosyncrasies, is being called. This allows for a higher consistency within the e.g. the omission of camel-case syntax in favor of free links (lead- wiki, since some elements will have to be written only once and ing to Wikipedia often being described as a rather atypical wiki). reused in several articles. Templates may also feature parameters, i.e. parametrized template calls will insert the parameter’s values The category system was introduced to MediaWiki in 2004 in or- in the replacing text. See Figure 2 for an example. Templates are der to solve that problem. Each article could be put into an ar- widely used in Wikipedia, e.g. the offers more bitrary number of categories that are identified by freely chosen than 200,000 templates. names. Adding a page to the category "Greece" is done by adding [[Category:Greece]] to the page. Category links themselves In 2006 some Wikipedians discovered that through an intricate and are not displayed in the article text (but rather in separate visual ele- complicated interplay of templating features and CSS they could ments on the rendering of a page). The category page is a page of its create conditional wiki text, i.e. text that was displayed if a tem- own in the newly introduced category namespace, thus separating plate parameter had a specific value. This included repeated calls of it cleanly from the article space. Category pages were programmed templates within templates, which bogged down the performance in such a way to display the list of all pages categorized with this of the whole system. The developers faced the choice of either dis- category. The new category system resolved all the identified tech- allowing the spreading of an obviously desired feature by detecting nical issues of using backlinks for categorizing articles. such usage and explicitly disallowing it within the software, or offer an efficient alternative. The latter was done by Tim Starling, who The category system was activated in all language editions at the in April 2006 announced the introduction of parser functions,2 i.e. same time without consulting the Wikipedia communities of the wiki text that calls functions implemented in the underlying soft- different languages. On most languages it was quickly applied to ware. categorize a majority of the existing pages. Soon new issues were discovered, namely how to organize categories themselves. This At first, only conditional text and the computation of simple math- lead to further new features like the category tree extensions, which ematical expressions was implemented, but this already increased renders a tree of subcategories on the page of a category etc. The the possibilities for wiki editors enormously. With time further introduction and application of the categories were heatedly de- parser functions were introduced, finally leading to a framework bated in some language communities. Restrictions on the usage of that allowed the simple writing of extension function to add arbi- categories were introduced and enforced solely by the community trary functionalities, like e.g. geo-coding services or widgets. This and with manual effort. On the , the commu- time the developers were clearly reacting to the demand of the com- nity decided on a moratorium on category usage, in order to first munity, being forced either to fight the solution of the issue that the define some guidelines to their application. The moratorium, and community had (i.e. conditional text), or offer an improved techni- the enforcement of the guidelines, were done fully manually by the cal implementation to replace the previous practice and achieve an community members. overall better performance.

Back then the power to develop, enable and disable new features lay basically with the software developers. The communities around 3.3 Flagged revisions the different language had no option but to accept these The most obvious issue with a wiki-based encyclopedia is the fact changes and deal with the aftermath. Technical details – like the that since anyone can edit all pages, defaced articles may be en- fact that redirects on categories did not mean that a contributor 2"In response to a campaign by users of the English Wikipedia to could use the redirecting category as a synonym for categorization harass developers by introducing increasingly ugly and inefficient – had enormous social impact, since they formed the patterns of meta-templates to popular pages, I’ve caved in and written a few interaction with the software. An analysis of the category system is reasonably efficient parser functions. There are two conditional functions and a mathematical expression function." – Tim Starling, given in [6] Wed, Apr 5, 2006 countered. The first approach towards dealing with this problem individual Wikipedia experience, e.g. by adding a more powerful was to protect some articles that tended to be defaced often (e.g. editor, or by hiding parts of the page output. The Widget extension George W. Bush, Abortion, or the Main Page). Protection means takes this to a new step, allowing less technically inclined users that the pages were only editable by users that have administra- to access some of these personalization features as well. This de- tor rights. Naturally, this approach decreases "wikiness". As a re- couples the user’s individual experience from the community and sult, "semi protection" was introduced, making articles editable by allows them to individually enhance Wikipedia with new features. all registered users. Preliminary research indicates that semi pro- tection greatly reduces risk of vandalism, especially of "drive-by" A recurring topic among developers and active users has been the vandalism; it does not, however, severely affect the collaborative conversion of wiki text or rendered HTML output to PDF. Early process. attempts were mostly based on manual work (like the WikiReader project), involving OpenOffice templates and its PDF output. These Flagged revisions enable a second layer to store the ad-hoc valida- projects triggered several pieces of software that implemented a tion of pages. After a trial phase, it is now a permanent feature in Medawiki to PDF converter or a different way to create more visu- the German language Wikipedia. In February 2009, all articles in ally appealing Wikipedia articles for print. In 2008, the Wikime- this language edition were assigned at least one flagged revision. dia Foundation announced a partnership with PediaPress GmbH, The remaining work is to keep up with edits by unregistered or where the commercial partner would create and maintain the soft- new users. Any Wikipedia author who is a member of the "editor" ware to create the output. This shows that also modules developed group can mark a revision of an article to be compliant with a set of outside of the MediaWiki Open Source developer community can conditions on content quality. The current condition for a revision become integrated into Wikipedia. to be flagged is the "lack of obvious forms of vandalism", a delib- erately low threshold. Unregistered visitors to Wikipedia (the vast 4. SEMANTIC ANNOTATIONS majority of the visits to Wikipedia) will be shown the latest flagged The Semantic MediaWiki project [5] allows to add structured data revision instead of the most recent article version. to wiki pages and exploit these both within the wiki through query- ing and browsing, and outside of the wiki by using the RDF export Both prior to implementation and after the initial start of this fea- of the annotation structure. In order to create the structured data, ture, a lengthy debate was held within the author community, re- an extension of the MediaWiki syntax is employed to allow for sulting in a straw poll. The preliminary result was a rather strong typed links and further annotation. On the page for Greece the wiki mandate to keep the feature enabled. Other language projects were text [[Capital::Athens]] would be interpreted as the triple given the choice to use this extension (the discussion on the En- "Greece - Capital - Athens". glish Wikipedia is being followed by newspapers). Developers and the community were tightly working together to realize this solu- It was soon noted that these annotations make the wiki text harder tion. This feature is expected to make a strong impression on the to read and may confuse and discourage novice contributors. To casual Wikipedia readers by ensuring that no spam or profanity is solve this, editors often moved the annotations from the text to displayed to them. Based on the exhaustive debate it is expected templates (see Section 3.2). For example, on the page of Greece that the feature will indeed achieve to effectively confine vandal- the Country template may be called, using the parameter "Capital": ism, and as of now the feature seems to function as intented. {{Country|Capital=Athens}} 3.4 Further examples This section points out to some other features that we consider The template can now create a consistent layout for country pages, worthwhile for further investigation due to their specifics differ- e.g. by including an infobox displaying the capital. In Semantic entiating them from the previous examples. MediaWiki, the template can also take care of annotating the coun- try article with the capital, thus moving the additional annotation Some early extensions may be surprising in their scope: the Hi- syntax from the article to the template. This shifts the burden from erowiki extension allowed to use Hieroglyphs within wiki text. This the casual Wikipedia editor to people who are comfortable with was created and advocated by a small user group, but due to the editing the already rather complex template syntax. Note that this early state of the overall project it was swiftly implemented and in- is not a technically enforced trade-off: the annotation may be either tegrated. Later the functionality needed to be refactored out of the in the article or in the template. It is up to the community to choose core, without affecting the usage in the Wikipedia content. and enforce how annotated links should be used.

The Cite extension created by core developer al- Instead of annotating links, the community may also choose to put lows to add citation information to an article and render them ap- the annotations at the bottom of the page, similar to the way cat- propriately. The Cite extension introduces new, XML-inspired syn- egory annotations are used today. So, articles would still include tax to the wikitext, which is rather inconsistent with the rest of the the above annotation, but no link would be displayed at that place. wiki-syntax. The community seemed to be not too bothered by that This decouples the annotations from the text. An example is given inconsistent syntax, but rather quickly adopted (and extended) the in Figure 3. We see that in option (b) the actual links in the text are possibilities of the new extension. annotated, whereas option (c) has all annotations at the end. The decoupling has a different trade-off: having the links annotated in The MediaWiki software is further extending the possibilities to en- the text increases consistency (since changing the capital of Greece able personalized for logged in users. Within the wiki, users may from Athens to Sparta would change both the text and the anno- add new Javascript and CSS commands that will be included in the tation); having the annotated links at the end increases readability. served HTML pages for the user. Some users offer Javascript snip- Again, this is a social choice – the technology allows for both op- pets that can be integrated into the user’s own Javascript extension tions: simplifying editing or increasing consistency. But this may page. This framework allows for almost arbitrary adaptations of the only be the micro-view on the issue: simplified editing could in Greece is a [[Europe]]an country. It’s changes in the technical platform in order to meet technical capital is [[Athens]]. (a) constraints without changing the practice of the community too much (as seen for parser functions, see Section 3.2) [[Category:Country]] • The introduction of a feature may lead to a huge perceived Greece is a [[Continent::Europe]]an change in the mission and scope of the whole system (a chal- country. It’s capital is lenge posed by the flagged revisions, see Section 3.3) (b) [[Capital::Athens]]. • In a socio-technical system, imperfections in the underlying [[Category:Country]] technical platform can be compensated by the community (rf. Greece is a [[Europe]]an country. It’s to the unusual syntax of the Cite extension, Section 3.4) capital is [[Athens]]. • A priori design of new features must not only regard and (c) resolve the technical challenges, but also consider possible [[Category:Country]] social implications. Our current plan for the eventual imple- [[Continent::Europe]] mentation of Semantic MediaWiki in Wikipedia is to openly [[Capital::Athens]] discuss the different alternatives with the community, and let them decide on the actual implementation. Figure 3: Source of a page about Greece in (a) plain Media- Wiki, (b) Semantic MediaWiki, (c) Semantic MediaWiki using category-style annotations. This paper has argued that Wikipedia can be used to evaluate and validate the proposed web science process model. We have used the descriptive power of the model for both the a posteriori analysis turn lead to increased community participation which may eventu- and the a priori design of new features for Wikipedia. ally lead to more consistency due to an increased number of eye- balls. It is safe to say that predicting the socio-technical impact of Acknowledgments such a choice is rather hard. Work presented in this paper has been partially funded by the Eu- ropean Union under EU IST FP7 project ACTIVE.3 An alternative approach to extract the data from Wikipedia is of- fered by the DBpedia project [1]. DBpedia extracts the data from the existing templates, not changing the technical component of 6. REFERENCES Wikipedia at all. Instead it creates a strong incentive for standardiz- [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and ing and controlling the usage of templates with much more scrutiny. Z. Ives. DBpedia: A nucleus for a web of open data. 6th Int. With DBpedia becoming more successful this may lead to more Semantic Web Conference, Busan, Korea, Nov, 2007. pressure to change the way templates are used in order to achieve [2] P. Ayers, C. Matthews, and B. Yates. How Wikipedia works. a better information extraction – thus pushing for a much stronger No Starch Press, San Francisco, CA, Oct. 2008. change in the social aspect than the bigger technical change of Se- [3] T. Berners-Lee. The two magics of web science. Keynote at mantic MediaWiki does. This again is a trade-off: bigger technical the WWW2007 conference in Banff, Canada. changes by implementing Semantic MediaWiki on the Wikipedia [4] J. Hendler, N. Shadbolt, W. Hall, T. Berners-Lee, and servers, or eventually leading to larger changes in the way the com- D. Weitzner. Web science: an interdisciplinary approach to munity applies templates. understanding the web. Commun. ACM, 51(7):60–69, 2008. [5] M. Krötzsch, D. Vrandeciˇ c,´ M. Völkel, H. Haller, and 5. OUTLOOK AND CONCLUSIONS R. Studer. Semantic Wikipedia. Journal of Web Semantics, Due to its open and the abundant and freely available data 5:251–261, Sept. 2007. about the project and its history, Wikipedia is frequently used as an [6] J. Voss. Collaborative thesaurus tagging the Wikipedia way. object of study. However, both the conducted research and the ex- The Computing Research Repository, abs/cs/0604036, 2006. amples of feature implementations suggest that the better the com- plex interaction of technical progress and social adaption is under- stood, the more likely a new technology can be successfully intro- duced to Wikipedia. This interaction may be regarded as the third magic of Web science, adding to the two magics already described by Tim Berners-Lee (see Fig. 1).

We are convinced that we can learn a number of lessons from an- alyzing the introduction of new features in Wikipedia. The pre- liminary survey in this paper indicates for example the following lessons:

• The lack of some technical features may lead to the mis- use of other existing features in order to simulate the desired functionality (as exemplified by the use of backlinks for cat- egories, Section 3.1) • Common usage practices of the community may necessitate 3http://www.active-project.eu)