Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History

Oliver Ferschke, Torsten Zesch, and Iryna Gurevych Ubiquitous Knowledge Processing Lab Computer Science Department, Technische Universitat¨ Darmstadt Hochschulstrasse 10, D-64289 Darmstadt, Germany http://www.ukp.tu-darmstadt.de

Abstract published by the as XML dumps at irregular intervals.1 Such a snapshot only We present an open-source toolkit which represents the state of Wikipedia at a certain fixed allows (i) to reconstruct past states of point in time, while Wikipedia actually is a dynamic Wikipedia, and (ii) to efficiently access the resource that is constantly changed by its millions of edit articles. Recon- structing past states of Wikipedia is a pre- editors. This rapid change is bound to have an influ- requisite for reproducing previous experimen- ence on the performance of NLP algorithms using tal work based on Wikipedia. Beyond that, Wikipedia data. However, the exact consequences the edit history of Wikipedia articles has been are largely unknown, as only very few papers have shown to be a valuable knowledge source for systematically analyzed this influence (Zesch and NLP, but access is severely impeded by the Gurevych, 2010). This is mainly due to older snap- lack of efficient tools for managing the huge shots becoming unavailable, as there is no official amount of provided data. By using a dedi- backup . As a consequence, older experimen- cated storage format, our toolkit massively de- creases the data volume to less than 2% of tal results cannot be reproduced anymore. the original size, and at the same time pro- In this paper, we present a toolkit that solves vides an easy-to-use interface to access the re- both issues by reconstructing a certain past state of vision data. The language-independent design Wikipedia from its edit history, which is offered by allows to process any language represented in the Wikimedia Foundation in form of a Wikipedia. We expect this work to consolidate dump. Section 3 gives a more detailed overview of NLP research using Wikipedia in general, and the reconstruction process. to foster research making use of the knowl- edge encoded in Wikipedia’s edit history. Besides reconstructing past states of Wikipedia, the revision history data also constitutes a novel knowledge source for NLP algorithms. The se- 1 Introduction quence of article edits can be used as training data for data-driven NLP algorithms, such as vandalism In the last decade, the free Wikipedia detection (Chin et al., 2010), text summarization has become one of the most valuable and com- (Nelken and Yamangil, 2008), sentence compres- prehensive knowledge sources in Natural Language sion (Yamangil and Nelken, 2008), unsupervised Processing. It has been used for numerous NLP extraction of lexical simplifications (Yatskar et al., tasks, e.g. word sense disambiguation, semantic re- 2010), the expansion of textual entailment corpora latedness measures, or text categorization. A de- (Zanzotto and Pennacchiotti, 2010), or assesing the tailed survey on usages of Wikipedia in NLP can be trustworthiness of Wikipedia articles (Zeng et al., found in (Medelyan et al., 2009). 2006). The majority of Wikipedia-based NLP algorithms works on single snapshots of Wikipedia, which are 1http://download.wikimedia.org/

97

Proceedings of the ACL-HLT 2011 System Demonstrations, pages 97–102, Portland, Oregon, USA, 21 June 2011. c 2011 Association for Computational Linguistics However, efficient access to this new resource Wikipedia’s edit history. Two established Wikipedia has been limited by the immense size of the data. have been considered for this purpose. The revisions for all articles in the current Miner3 (Milne and Witten, 2009) is Wikipedia sum up to over 5 terabytes of text. Con- an open source toolkit which provides access to sequently, most of the above mentioned previous Wikipedia with the help of a preprocessed database. work only regarded small samples of the available It represents articles, categories and redirects as Java data. However, using more data usually leads to bet- classes and provides access to the article content ei- ter results, or how Church and Mercer (1993) put ther as MediaWiki markup or as plain text. The it “more data are better data”. Thus, in Section 4, toolkit mainly focuses on Wikipedia’s structure, the we present a tool to efficiently access Wikipedia’s contained concepts, and semantic relations, but it edit history. It provides an easy-to-use API for pro- makes little use of the textual content within the ar- grammatically accessing the revision data and re- ticles. Even though it was developed to work lan- duces the required storage space to less than 2% of guage independently, it focuses mainly on the En- its original size. Both tools are publicly available glish Wikipedia. on Code (http://jwpl.googlecode. Another open source API for accessing Wikipedia com) as open source under the LGPL v3. data from a preprocessed database is JWPL4 (Zesch et al., 2008). Like Wikipedia Miner, it also rep- 2 Related Work resents the content and structure of Wikipedia as To our knowledge, there are currently only two alter- Java objects. In addition to that, JWPL contains a natives to programmatically access Wikipedia’s re- MediaWiki markup parser to further analyze the ar- vision history. ticle contents to make available fine-grained infor- One possibility is to manually parse the original mation like e.g. article sections, info-boxes, or first XML revision dump. However, due to the huge size paragraphs. Furthermore, it was explicitly designed of these dumps, efficient, random access is infeasi- to work with all language versions of Wikipedia. ble with this approach. We have chosen to extend JWPL with our revi- Another possibility is using the MediaWiki API2, sion toolkit, as it has better support for accessing ar- a web service which directly accesses live data from ticle contents, natively supports multiple languages, the Wikipedia . However, using a web ser- and seems to have a larger and more active developer vice entails that the desired revision for every single community. In the following section, we present the article has to be requested from the service, trans- parts of the toolkit which reconstruct past states of ferred over the Internet and then stored locally in Wikipedia, while in section 4, we describe tools al- an appropriate format. Access to all revisions of lowing to efficiently access Wikipedia’s edit history. all Wikipedia articles for a large-scale analysis is 3 Reconstructing Past States of Wikipedia infeasible with this method because it is strongly constricted by the data transfer speed over the In- Access to arbitrary past states of Wikipedia is re- ternet. Even though it is possible to bypass this bot- quired to (i) evaluate the performance of Wikipedia- tleneck by setting up a local Wikipedia mirror, the based NLP algorithms over time, and (ii) to repro- MediaWiki API can only provide full text revisions, duce Wikipedia-based research results. For this rea- which results in very large amounts of data to be son, we have developed a tool called TimeMachine, transferred. which addresses both of these issues by making use Better suited for tasks of this kind are APIs of the revision dump provided by the Wikimedia that utilize for storing and accessing the Foundation. By iterating over all articles in the re- Wikipedia data. However, current database-driven vision dump and extracting the desired revision of Wikipedia APIs do not support access to article re- each article, it is possible to recover the state of visions. That is why we decided to extend an es- Wikipedia at an earlier point in time. tablished API with the ability to efficiently access 3http://wikipedia-miner.sourceforge.net 2http://www.mediawiki.org/wiki/API 4http://jwpl.googlecode.com

98 Property Description Example Value language The Wikipedia language version english mainCategory Title of the main category of the Categories Wikipedia language version used disambiguationCategory Title of the disambiguation category of Disambiguation the Wikipedia language version used fromTimestamp Timestamp of the first snapshot to be 20090101130000 extracted toTimestamp Timestamp of the last snapshot to be ex- 20091231130000 tracted each Interval between snapshots in days 30 removeInputFilesAfterProcessing Remove source files [true/false] false metaHistoryFile Path to the revision dump PATH/pages-meta-history..bz2 pageLinksFile Path to the page-to-page link records PATH/pagelinks.sql.gz categoryLinksFile Path to the category membership PATH/categorylinks.sql.gz records outputDirectory Output directory PATH/outdir/

Table 1: Configuration of the TimeMachine

The TimeMachine is controlled by a single con- if they are subject to copyright violations, vandal- figuration file, which allows (i) to restore individual ism, spam or other conditions that violate Wikipedia Wikipedia snapshots or (ii) to generate whole snap- policies. As a consequence, they are removed from shot series. 1 gives an overview of the con- the public view along with all their revision infor- figuration parameters. The first three properties set mation, which makes it impossible to recover them the environment for the specific language version of from any future publicly available dump.5 Even Wikipedia. The two timestamps define the start and though about five thousand pages are deleted every end time of the snapshot series, while the interval day, only a small percentage of those pages actually between the snapshots in the series is set by the pa- corresponds to meaningful articles. Most of the af- rameter each. In the example, the TimeMachine re- fected pages are newly created duplicates of already covers 13 snapshots between Jan 01, 2009 at 01.00 existing articles or spam articles. p.m and and Dec 31, 2009 at 01.00 p.m at an inter- val of 30 days. In order to recover a single snap- 4 Efficient Access to Revisions shot, the two timestamps have simply to be set to Even though article revisions are available from the the same value, while the parameter ‘each’ has no official Wikipedia revision dumps, accessing this in- effect. The option removeInputFilesAfterProcessing formation on a large scale is still a difficult task. specifies whether to delete the source files after pro- This is due to two main problems. First, the revi- cessing has finished. The final four properties define sion dump contains all revisions as full text. This the paths to the source files and the output directory. results in a massive amount of data and makes struc- The output of the TimeMachine is a set of eleven tured access very hard. Second, there is no efficient text files for each snapshot, which can directly be API available so far for accessing article revisions imported into an empty JWPL database. It can be on a large scale. accessed with the JWPL API in the same way as Thus, we have developed a tool called snapshots created using JWPL itself. RevisionMachine, which solves these issues. First, we describe our solution to the storage prob- Issue of Deleted Articles The past snapshot of lem. Second, we present several use cases of the Wikipedia created by our toolkit is identical to the RevisionMachine, and show how the API simplifies state of Wikipedia at that time with the exception of experimental setups. articles that have been deleted meanwhile. Articles 5http://en.wikipedia.org/wiki/Wikipedia: might be deleted only by DEL

99 4.1 Revision Storage • Finally, the string representation of this ac- As each revision of a Wikipedia article stores the tion sequence is compressed and stored in the full article text, the revision history obviously con- database. tains a lot of redundant data. The RevisionMachine With this approach, we achieve to reduce the de- makes use of this fact and utilizes a dedicated stor- mand for disk space for a recent English Wikipedia age format which stores a revision only by means dump containing all article revisions from 5470 GB of the changes that have been made to the previous to only 96 GB, i.e. by 98%. The compressed data is revision. For this purpose, we have tested existing stored in a MySQL database, which provides sophis- libraries, like Javaxdelta6 or java-diff7, which ticated indexing mechanisms for high-performance calculate the differences between two texts. How- access to the data. ever, both their runtime and the size of the result- Obviously, storing only the changes instead of ing output was not feasible for the given size of the the full text of each revision trades in speed for data. Therefore, we have developed our own diff space. Accessing a certain revision now requires re- algorithm, which is based on a longest common sub- constructing the text of the revision from a list of string search and constitutes the foundation for our changes. As articles often have several thousand re- revision storage format. visions, this might take too long. Thus, in order to The processing of two subsequent revisions can speed up the recovery of the revision text, every n-th be divided into four steps: revision is stored as a full revision. A low value of • First, the RevisionMachine searches for all n decreases the time needed to access a certain re- common substrings with a user-defined mini- vision, but increases the demand for storage space. mal length. We have found n = 1000 to yield a good trade-off9. This parameter, among a few other possibilities to • Then, the revisions are divided into blocks of fine-tune the process, can be set in a graphical user equal length. Corresponding blocks of both interface provided with the RevisionMachine. revisions are then compared. If a block is contained in one of the common substrings, it can be marked as unchanged. Otherwise, 4.2 Revision Access we have to categorize the kind of change After the converted revisions have been stored in that occurred in this block. We differenti- the revision database, it can either be used stand- ate between five possible actions: Insert, alone or combined with the JWPL data and ac- Delete, Replace, Cut and Paste8. This cessed via the standard JWPL API. The latter op- information is stored in each block and is later tion makes it possible to combine the possibilities on used to encode the revision. of the RevisionMachine with other components like • In the next step, the current revision is repre- the JWPL parser for the MediaWiki syntax. sented by means of a sequence of actions per- In order to set up the RevisionMachine, it is only formed on the previous revision. necessary to provide the configuration details for the For example, in the adjacent revision pair database connection (see Listing 1). Upon first ac- r1 : This is the very first sentence! cess, the database user has to have write permission r2 : This is the second sentence on the database, as indexes have to be created. For r2 can be encoded as later use, read permission is sufficient. Access to the REPLACE 12 10 ’second’ RevisionMachine is achieved via two API objects. DELETE 31 1 The RevisionIterator allows to iterate over all revi- 6http://javaxdelta.sourceforge.net/ sions in Wikipedia. The RevisionAPI grants access 7http://www.incava.org/projects/java/ to the revisions of individual articles. In addition to java-diff 8Cut and Paste operations always occur pairwise. In ad- 9If hard disk space is no limiting factor, the parameter can be dition to the other operations, they can make use of an additional set to 1 to avoid the compression of the revisions and maximize temporary storage register to save the text that is being moved. the performance.

100 //Set up database connection DatabaseConfiguration db = new DatabaseConfiguration (); db . setDatabase (”dbname”); db . setHost (”hostname”); db . setUser (”username”); db . setPassword (”pwd”); db . setLanguage ( Language . english ); // CreateAPI objects Wikipedia wiki = WikiConnectionUtils . getWikipediaConnection ( db ); RevisionIterator revIt = new RevisionIterator ( db ); RevisionApi revApi = new RevisionApi ( db ); Listing 1: Setting up the RevisionMachine that, the Wikipedia object provides access to JWPL 5 Conclusions functionalities. In this paper, we presented an open-source toolkit In the following, we describe three use cases of which extends JWPL, an API for accessing the RevisionMachine API, which demonstrate how Wikipedia, with the ability to reconstruct past states it is easily integrated into experimental setups. of Wikipedia, and to efficiently access the edit his- tory of Wikipedia articles. Processing all article revisions in Wikipedia Reconstructing past states of Wikipedia is a The first use case focuses on the utilization of the prerequisite for reproducing previous experimen- complete set of article revisions in a Wikipedia snap- tal work based on Wikipedia, and is also a re- shot. Listing 2 shows how to iterate over all revi- quirement for the creation of time-based series of sions. Thereby, the iterator ensures that successive Wikipedia snapshots and for assessing the influence revisions always correspond to adjacent revisions of of Wikipedia growth on NLP algorithms. Further- a single article in chronological order. The start of more, Wikipedia’s edit history has been shown to be a new article can easily be detected by checking the a valuable knowledge source for NLP, which is hard timestamp and the article id. This approach is es- to access because of the lack of efficient tools for pecially useful for applications in statistical natural managing the huge amount of revision data. By uti- language processing, where large amounts of train- lizing a dedicated storage format for the revisions, ing data are a vital asset. our toolkit massively decreases the amount of data to be stored. At the same time, it provides an easy- Processing revisions of individual articles The to-use interface to access the revision data. second use case shows how the RevisionMachine We expect this work to consolidate NLP re- can be used to access the edit history of a specific search using Wikipedia in general, and to foster article. The example in Listing 3 illustrates how all research making use of the knowledge encoded in revisions for the article Automobile can be retrieved Wikipedia’s edit history. The toolkit will be made by first performing a page query with the JWPL API available as part of JWPL, and can be obtained from and then retrieving all revision timestamps for this the project’s website at Google Code. (http:// page, which can finally be used to access the revi- jwpl.googlecode.com) sion objects. Acknowledgments Accessing the meta data of a revision The third use case illustrates the access to the meta data of in- This work has been supported by the Volkswagen Foun- dividual revisions. The meta data includes the name dation as part of the Lichtenberg-Professorship Program or IP of the contributor, the additional user comment under grant No. I/82806, and by the Hessian research excellence program “Landes-Offensive zur Entwicklung for the revision and a flag that identifies a revision as Wissenschaftlich-okonomischer¨ Exzellenz” (LOEWE) as minor or major. Listing 4 shows how the number of part of the research center ”Digital Humanities”. We edits and unique contributors can be used to indicate would also like to thank Simon Kulessa for designing and the level of edit activity for an article. implementing the foundations of the RevisionMachine.

101 // Iterate over all revisions of all articles w hi l e( revIt . hasNext ()) { Revision rev = revIt . next () rev . getTimestamp (); rev . getArticleID (); // process revision... } Listing 2: Iteration over all revisions of all articles

//Get article with title”Automobile” Page article = wiki . getPage (”Automobile”); i n t id = article . getPageId (); //Get all revisions for the article Collection revisionTimeStamps = revApi . getRevisionTimestamps ( id ); f o r( Timestamp t : revisionTimeStamps ) { Revision rev = revApi . getRevision ( id , t ); // process revision... } Listing 3: Accessing the revisions of a specific article

// Meta data provided by the RevisionAPI StringBuffer s = new StringBuffer (); s . append (”The article has”+ revApi . getNumberOfRevisions ( pageId ) +” revisions. \ n”); s . append (”It has”+ revApi . getNumberOfUniqueContributors ( pageId ) +” unique contributors. \ n”); s . append ( revApi . getNumberOfUniqueContributors ( pageId ,true)+” are registered users. \ n”); // Meta data provided by the Revision object s . append (( rev . isMinor ()?”Minor”:”Major”)+” revision by:”+ rev . getContributorID ()); s . append (” \nComment:”+ rev . getComment ()); Listing 4: Accessing the meta data of a revision

References Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu- Mizil, and Lillian Lee. 2010. For the sake of simplic- Si-Chi Chin, W. Nick Street, Padmini Srinivasan, and ity: unsupervised extraction of lexical simplifications David Eichmann. 2010. Detecting wikipedia vandal- from wikipedia. In Human Language Technologies: ism with active learning and statistical language mod- The 2010 Annual Conference of the North American Proceedings of the 4th workshop on Informa- els. In Chapter of the Association for Computational Linguis- tion credibility, WICOW ’10, pages 3–10. tics, HLT ’10, pages 365–368. Kenneth W. Church and Robert L. Mercer. 1993. Intro- Fabio Massimo Zanzotto and Marco Pennacchiotti. duction to the special issue on computational linguis- 2010. Expanding textual entailment corpora from tics using large corpora. Computational Linguistics, wikipedia using co-training. In Proceedings of the 19(1):1–24. COLING-Workshop on The People’s Web Meets NLP: Olena Medelyan, David Milne, Catherine Legg, and Collaboratively Constructed Semantic Resources. Ian H. Witten. 2009. Mining meaning from wikipedia. Honglei Zeng, Maher Alhossaini, Li Ding, Richard Fikes, Int. J. Hum.-Comput. Stud. , 67:716–754, September. and Deborah L. McGuinness. 2006. Computing trust D. Milne and I. H. Witten. 2009. An open-source toolkit from revision history. In Proceedings of the 2006 In- for mining Wikipedia. In Proc. New Zealand Com- ternational Conference on Privacy, Security and Trust. puter Science Research Student Conf., volume 9. Torsten Zesch and Iryna Gurevych. 2010. The more the Rani Nelken and Elif Yamangil. 2008. Mining better? Assessing the influence of wikipedia’s growth wikipedia’s article revision history for training com- on semantic relatedness measures. In Proceedings of putational linguistics algorithms. In Proceedings of the Conference on Language Resources and Evalua- the AAAI Workshop on Wikipedia and Artificial - tion (LREC), Valletta, Malta. ligence: An Evolving Synergy (WikiAI), WikiAI08. Torsten Zesch, Christof Mueller, and Iryna Gurevych. Elif Yamangil and Rani Nelken. 2008. Mining wikipedia 2008. Extracting Lexical Semantic Knowledge from revision histories for improving sentence compres- Wikipedia and . In Proceedings of the sion. In Proceedings of ACL-08: HLT, Short Papers, Conference on Language Resources and Evaluation pages 137–140, Columbus, Ohio, June. Association (LREC). for Computational Linguistics.

102