A Free, Balanced, Multilayer English Web Corpus

Total Page:16

File Type:pdf, Size:1020Kb

A Free, Balanced, Multilayer English Web Corpus Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5267–5275 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC AMALGUM – A Free, Balanced, Multilayer English Web Corpus Luke Gessler, Siyao Peng, Yang Liu, Yilun Zhu, Shabnam Behzad, Amir Zeldes Corpling Lab Georgetown University flg876, sp1184, yl879, yz565, sb1796, [email protected] Abstract We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a “better than NLP” benchmark and evaluate the accuracy of the resulting resource. Keywords: web corpora, annotation, discourse, coreference, NER, RST 1. Introduction target domain(s). In addition, we focus on making discourse- Corpus development for academic research and practical level annotations available, including complete nested entity NLP applications often meets with a dilemma: develop- recognition, coreference resolution, and discourse parsing, ment of high quality data sets with rich annotation schemes while striving for accuracy that is as high as possible. Some (e.g. MASC (Ide et al., 2010)) requires substantial man- of the applications we envision for this corpus include: ual curation and analysis effort and does not scale up eas- ily; on the other hand, easy “opportunistic” data collection • Corpus linguistic studies on genre variation and dis- (Kennedy, 1998), where as much available material is col- course using detailed annotation layers lected from the Internet as possible, usually means little • Active learning – by permuting different subsets of the or no manual annotation or inspection of corpus contents automatically annotated corpus as training data, we can for genre balance or other distributional properties, and at evaluate performance on gold standard development most the addition of relatively reliable automatic annota- data and automatically select those documents whose tions such as part of speech tags and lemmatization (e.g. the annotations improve performance on a given task WaCky corpora (Baroni et al., 2009)), or sometimes out- • Cross-tagger validation – we can select the subset of of-the-box, automatic dependency parsing (see the COW sentences which are tagged or parsed identically by corpora, (Schafer,¨ 2015)). This divide leads to a number of multiple taggers/models trained on different datasets problems: and assume that these are highly likely to be correct (cf. Derczynski et al. (2012)), or use other similar 1. The study of systematic genre variation is limited to approaches such as tri-training (cf. Zhou and Li (2005)) surface textual properties and cannot use complex an- notations, unless it is confined to very limited data • Human-in-the-loop/crowdsourcing – outputting sen- tences with low NLP tool confidence for expert correc- 2. For large resources, NLP quality is low since tion or crowd worker annotations, or using a hybrid opportunistically-acquired content often deviates sub- annotation setup in which several models are used, and stantially from the language found in training data used the tags that the models agree on are only reviewed by out-of-the-box tools by annotators and the rest are annotated from scratch (Berzak et al., 2016) 3. Large open data sets for training NLP tools on com- plex phenomena (e.g. coreference resolution, discourse • Pre-training – the error-prone but large automatically parsing) are unavailable1 produced corpus can be used for pre-training a model which can later be fine-tuned on a smaller gold dataset In this paper, we attempt to walk a middle path, combining some of the best features of corpus data harvested from the Testing these applications is outside the scope of the cur- web – size, open licenses, lexical diversity – and data col- rent paper, though several papers suggest that large scale lected using a carefully defined sampling frame (cf. Hundt ‘silver quality’ data can be better for training tools than (2008)), which allows for more interpretable inferences as smaller gold standard resources for POS tagging (Schulz well as the application of tools specially prepared for the and Ketschik, 2019), parsing (Toutanova, 2005) using infor- mation from parse banks (Charniak et al., 2000), discourse 1The largest datasets for these tasks (e.g. RST Discourse Tree- relation classification (Marcu and Echihabi, 2002), and other bank (Carlson et al., 2003), OntoNotes (Weischedel et al., 2012)), tasks. Moreover, due to its genre diversity, this corpus can contain between 200K and 1.6 million words and are only available be useful for genre studies as well as automatic text type for purchase from the LDC. identification in Web corpora (e.g. Asheghi et al. (2014), 5267 Rehm et al. (2008), and Dalan and Sharoff (2016)). We plan coreference in OntoNotes) and 4–20 times the size of the to pursue several of these approaches in future work (see largest English discourse treebanks (around 1M tokens for Section 5). the shallow parsed PDTB (Prasad et al., 2019), and 200K Our main contributions in this paper consist in (1) presenting tokens for RST-DT with full document trees (Carlson et al., a genre-balanced, richly-annotated web corpus which is 2003), both of which are restricted to WSJ text). The close more than double (and in some cases 20 times) the size of match in content to the manually annotated GUM means the largest similar gold annotated resources; (2) evaluating that we expect AMALGUM to be particularly useful for ap- tailored NLP tools which have been trained specifically for plications such as active learning when combining resources this task and comparing them to off-the-shelf tools; and from the gold standard GUM data and the automatically (3) demonstrating the added value of rich annotations and prepared data presented here. ensembling of multiple concurrent tools, which improve the accuracy of each task by incorporating information from 2.2. Document Filtering other tasks and tool outputs. Documents were scraped from sites that provide data under In the remainder of this paper, we will describe and evaluate open licenses (in most cases, Creative Commons licenses). the quality of the dataset: In the next section, we present the The one exception is text from Reddit forum discussions, data collected for our corpus, AMALGUM (AMachine- which cannot be distributed directly. To work around this, Annotated Lookalike of GUM), which covers eight English we distribute only stand-off annotations of the Reddit text, web genres for which we can obtain gold standard train- along with a script that recovers the plain text using the ing data: news, interviews, travel guides, how-to guides, Reddit API and then combines it with the annotations. academic papers, fiction, biographies, and forum discus- In order to ensure high quality, we applied a number of sions. Section 2 describes the data and how it was collected, heuristics during scraping to rule out documents that are Section 3 presents annotations added to the corpus and the likely to differ strongly from the original GUM data: for fic- process of using information across annotation layers to im- tion, we required the keyword “fiction” to appear in Project prove NLP tool output, Section 4 evaluates the quality of Gutenberg metadata, and disqualified documents that con- the resulting data, and Section 5 concludes with future plans tained archaic forms (e.g. “thou”) or unrestored hyphenation and applications for the corpus. Our corpus is available at (e.g. tokens like “disre-”, from broken “disregard”) by us- https://github.com/gucorpling/amalgum. ing a stop list of frequent items from sample documents marked for exclusion by a human. For Reddit, we ruled out 2. Data documents containing too many links or e-mail addresses, 2.1. Corpus Composition which were often just lists of links. In all wiki-based genres, In order to ensure relatively high-quality NLP output and references, tables of contents, empty sections, and all other an open license, we base our dataset on the composition of portions that contained non-primary or boilerplate text were an existing, smaller web corpus of English, called GUM removed. (Georgetown University Multilayer corpus (Zeldes, 2017)). The GUM corpus contains data from the same genres men- 2.3. Document Extent tioned above, currently amounting to approximately 130,000 Since the corpus described here is focused on discourse-level tokens. We use the term genre somewhat loosely here to annotations, such as coreference resolution and discourse describe any recurring combination of features which char- parses, we elected to only use documents similar in size to acterize groups of texts that are created under similar ex- the documents in the benchmark GUM data, which gener- tralinguistic conditions and with comparable communicative ally has texts of about 500–1,000 words. This restriction intent (cf. (Biber and Conrad, 2009)). The corpus is manu- was also necessary due to the disparity across genres, and ally annotated with a large number
Recommended publications
  • The Culture of Wikipedia
    Good Faith Collaboration: The Culture of Wikipedia Good Faith Collaboration The Culture of Wikipedia Joseph Michael Reagle Jr. Foreword by Lawrence Lessig The MIT Press, Cambridge, MA. Web edition, Copyright © 2011 by Joseph Michael Reagle Jr. CC-NC-SA 3.0 Purchase at Amazon.com | Barnes and Noble | IndieBound | MIT Press Wikipedia's style of collaborative production has been lauded, lambasted, and satirized. Despite unease over its implications for the character (and quality) of knowledge, Wikipedia has brought us closer than ever to a realization of the centuries-old Author Bio & Research Blog pursuit of a universal encyclopedia. Good Faith Collaboration: The Culture of Wikipedia is a rich ethnographic portrayal of Wikipedia's historical roots, collaborative culture, and much debated legacy. Foreword Preface to the Web Edition Praise for Good Faith Collaboration Preface Extended Table of Contents "Reagle offers a compelling case that Wikipedia's most fascinating and unprecedented aspect isn't the encyclopedia itself — rather, it's the collaborative culture that underpins it: brawling, self-reflexive, funny, serious, and full-tilt committed to the 1. Nazis and Norms project, even if it means setting aside personal differences. Reagle's position as a scholar and a member of the community 2. The Pursuit of the Universal makes him uniquely situated to describe this culture." —Cory Doctorow , Boing Boing Encyclopedia "Reagle provides ample data regarding the everyday practices and cultural norms of the community which collaborates to 3. Good Faith Collaboration produce Wikipedia. His rich research and nuanced appreciation of the complexities of cultural digital media research are 4. The Puzzle of Openness well presented.
    [Show full text]
  • Public Wikis: Sharing Knowledge Over the Internet
    Technical Communication Wiki : Chapter 2 This page last changed on Apr 15, 2009 by kjp15. Public Wikis: Sharing Knowledge over the Internet What is a public wiki? A public wiki refers to a wiki that everyone can edit, that is, it is open to the general public on the Internet with or without a free login. Anyone is welcome to add information to the wiki, but contributors are asked to stay within the subject area, policies, and standards of the particular wiki. Content that does not relate to the subject or is not of good quality can and will be deleted by other users. Also, if a contributor makes a mistake or a typo, other users will eventually correct the error. Open collaboration by anyone means that multiple articles can be written very fast simultaneously. And, while there may be some vandalism or edit wars, quality of the articles will generally improve over time after many editors have contributed. The biggest public wiki in the world, with over 75,000 contributors, is Wikipedia, "the free encyclopedia that anyone can edit" [1]. Wikipedia has over 2.8 million articles in the English version, which includes over 16 million wiki pages, and is one of the top ten most popular websites in the United States [2]. There are versions of Wikipedia in ten different languages, and articles in more than 260 languages. Wikipedia is operated by the non-profit Wikimedia Foundation [3] which pays a small group of employees and depends mainly on donations. There are also over 1,500 volunteer administrators with special powers to keep an eye on editors and make sure they conform to Wikipedia's guidelines and policies.
    [Show full text]
  • Wikimania 2006 2006 the 2Nd International Wikimedia Foundation Conference
    WikimaniaWikimania 2006 2006 The 2nd International Wikimedia Foundation Conference Cambridge, Massachussetts, USA August 4-6, 2006 http://wikimania2006.wikimedia.org/ To our Sponsors: Thank you! Patrons: Benefactors: Friends: TM Supporters: | 2 | Wikimania ‘06 Welcome Conference Welcome Welcome to the second an- lawyers, technologists), and we nual international Wikimedia hope that this diversity will help conference! The Wikimedia each of you to consider free Foundation is excited to host knowledge, open information, the conference -- and proud and collaborative peer produc- that it is now officially an annual tion in all of their forms. event! The past year has been One thing worth emphasiz- an exciting one for the Founda- ing: this conference is not for tion, filled with challenges and spectators. Participate actively record growth across the proj- in discussions you go to; ask ects. Questions raised by these questions; leave comments on challenges and changes will be the conference wiki; give an well debated, and in some cases improptu lightning presenta- answered, during the conference. tion; share crazy ideas with your On behalf of the many fellow conferees. And have fun! people who have made Wikima- Whether it’s competing in the nia 2006 a reality, we are pleased world Calvinball championship to welcome Wikimaniacs to or debating whether consensus Cambridge, Massachusetts, and really scales, we want you to to Harvard University. We hope have a good time. Please talk to you enjoy your stay here. We a volunteer or information desk would like to extend a special staffer if you have questions or welcome to our international need help during the conference.
    [Show full text]
  • Working-With-Mediawiki-Yaron-Koren.Pdf
    Working with MediaWiki Yaron Koren 2 Working with MediaWiki by Yaron Koren Published by WikiWorks Press. Copyright ©2012 by Yaron Koren, except where otherwise noted. Chapter 17, “Semantic Forms”, includes significant content from the Semantic Forms homepage (https://www. mediawiki.org/wiki/Extension:Semantic_Forms), available under the Creative Commons BY-SA 3.0 license. All rights reserved. Library of Congress Control Number: 2012952489 ISBN: 978-0615720302 First edition, second printing: 2014 Ordering information for this book can be found at: http://workingwithmediawiki.com All printing of this book is handled by CreateSpace (https://createspace.com), a subsidiary of Amazon.com. Cover design by Grace Cheong (http://gracecheong.com). Contents 1 About MediaWiki 1 History of MediaWiki . 1 Community and support . 3 Available hosts . 4 2 Setting up MediaWiki 7 The MediaWiki environment . 7 Download . 7 Installing . 8 Setting the logo . 8 Changing the URL structure . 9 Updating MediaWiki . 9 3 Editing in MediaWiki 11 Tabs........................................................... 11 Creating and editing pages . 12 Page history . 14 Page diffs . 15 Undoing . 16 Blocking and rollbacks . 17 Deleting revisions . 17 Moving pages . 18 Deleting pages . 19 Edit conflicts . 20 4 MediaWiki syntax 21 Wikitext . 21 Interwiki links . 26 Including HTML . 26 Templates . 27 3 4 Contents Parser and tag functions . 30 Variables . 33 Behavior switches . 33 5 Content organization 35 Categories . 35 Namespaces . 38 Redirects . 41 Subpages and super-pages . 42 Special pages . 43 6 Communication 45 Talk pages . 45 LiquidThreads . 47 Echo & Flow . 48 Handling reader comments . 48 Chat........................................................... 49 Emailing users . 49 7 Images and files 51 Uploading . 51 Displaying images . 55 Image galleries .
    [Show full text]
  • How to Download Google Chrome
    H O W T O D O W N L O A D G O O G L E C H R O M E ( W E B B R O W S E R ) Article provided by wikiHow, a wiki building the world's largest, highest quality how-to manual. Content on wikiHow can be shared u nder a Creative Commons License. Written by: Tech Tested. https://www.wikihow.com/Download-and-Install-Google-Chrome Google Chrome is a lightweight browser that is free to download for Windows, Mac OS X, Linux, Android, and iOS. Follow this guide to get it downloaded and installed on your system of choice. This is recommended for using the online virtual classroom platform, 'Blackboard Collaborate'. Downloading Chrome for PC/Mac/Linux Go to the Google Chrome website (https://www.google.com/chrome/). You can use any web browser to download Google Chrome. If you haven’t installed a browser, you can use your operating system’s preinstalled web browser (Internet Explorer for Windows and Safari for Mac OS X). 2. Click "Download Chrome". Make sure to click the download option corresponding to your operating system (Mac or Windows). This will open the Terms of Service window. 3. Determine if you want Chrome as your default browser. If you set it as the default browser, it will open whenever a link for a web page is clicked in another program, such as email. You can opt to send usage data back to Google by checking the box labeled “Help make Google Chrome better…” This will send back crash reports, Main screen.
    [Show full text]
  • A Resource and Analyses on Edits in Instructional Texts
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5721–5729 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC wikiHowToImprove: A Resource and Analyses on Edits in Instructional Texts Talita Rani Anthonio∗, Irshad Ahmad Bhat∗, Michael Roth University of Stuttgart Institute for Natural Language Processing fanthonta,bhatid,[email protected] Abstract Instructional texts, such as articles in wikiHow, describe the actions necessary to accomplish a certain goal. In wikiHow and other resources, such instructions are subject to revision edits on a regular basis. Do these edits improve instructions only in terms of style and correctness, or do they provide clarifications necessary to follow the instructions and to accomplish the goal? We describe a resource and first studies towards answering this question. Specifically, we create wikiHowToImprove, a collection of revision histories for about 2.7 million sentences from about 246 000 wikiHow articles. We describe human annotation studies on categorizing a subset of sentence-level edits and provide baseline models for the task of automatically distinguishing “older” from “newer” versions of a sentence. Keywords: Corpus creation, Semantics, Other 1. Introduction Text Timestamp The ever increasing size of the World Wide Web has made 1. Cut strips of paper and then write .. nouns on them (. ) it possible to find instructional texts, or how-to guides, on 2. Put the pieces of paper into a hat or bag. (. ) practically any topic or activity. wikiHow is an online plat- 3. Have the youngest player choose the first piece of paper.
    [Show full text]
  • What Is Creative Commons? 1
    creative commons CREATIVE COMMONS for Educators and Librarians CREATIVE COMMONS for Educators and Librarians ALA Editions purchases fund advocacy, awareness, and accreditation programs for library professionals worldwide. CREATIVE COMMONS for Educators and Librarians CREATIVE COMMONS CHICAGO 2020 Available in print from the ALA Store and distributors. © 2020 Creative Commons. Unless otherwise noted, Creative Commons for Educators and Librar- ians is licensed Creative Commons Attribution 4.0 International (https://creative commons.org/licenses/by/4.0). This book is published under a CC BY license, which means that you can copy, redistribute, remix, transform, and build upon the content for any purpose, even commercially, as long as you give appropriate credit, provide a link to the license, indicate if changes were made, and do not impose additional terms or conditions on others that prohibit them from exercising the rights granted by that license, includ- ing any effective technological measures. A digital version of this book is available for download at https://certificates.creativecommons .org/about/certificate-resources-cc-by. Extensive effort has gone into ensuring the accuracy and reliability of the information in this book; however, neither the publisher nor the author make any warranty, express or implied, with respect to its content. Additionally, this book contains third party content, such as illustrations, photographs and graphs, that may not be owned by Creative Commons but that is licensed by the third party under a Creative Commons license. Neither the publisher nor Creative Commons war- rant that re-use of that content will not infringe the rights of third parties. Reasonable efforts have been made to identify that content when possible.
    [Show full text]
  • What-Is-Social-Media-Uk.Pdf
    WHAT IS SOCIAL MEDIA? An e-book by Antony Mayfield from iCrossing V1.4 UPDATED 01.08.08 IMAGE: WEATHER PROJECT BW 01 BY: NICK WINCHESTER WWW.SXC.HU/PROFILE/NICKWINCH > icrossing.co.uk/ebooks > CONTENTS 2 > 2 INTRODUCTION 4 > WHAT IS SOCIAL MEDIA? 5 > THE NEW MEANS OF PRODUCTION AND DISTRIBUTION 8 > HOW SOCIAL MEDIA WORKS 11 > HOW SOCIAL NETWORKS WORK 14 > HOW BLOGS WORK 16 > HOW WIKIS WORK 19 > HOW PODCASTS WORK 21 > HOW FORUMS WORK 23 > HOW CONTENT COMMUNITIES WORK 24 > HOW MICRO-BLOGGING WORKS 27 > HOW SECOND LIFE WORKS 28 > ABOUT ICROSSING 31 > ABOUT THE AUTHOR 32 > CREATIVE COMMONS COPYRIGHT 33 > GLOSSARY 34 > USEFUL WEBSITES 36 > What is Social Media?: an e-book by Antony Mayfield from iCrossing updated 01.08.08 Social computing is not a fad. Nor is it something that will pass you or your company by. Gradually, social computing will 3 impact almost every role, at every kind of company, in all parts of the world. Forrester Research, How Networks Social Computing Erode Institutional Power, And What To Do About It What is Social Media?: an e-book by Antony Mayfield from iCrossing updated 01.08.08 INTRODUCTION Thanks for downloading this e-book. It’s written as a short, sweet summary of the phenomenon called social media. It’s an unashamedly straightforward work, intended to give you a brief overview of the story so far, maybe fill in a few gaps and act as a reference guide. It’s intended for anyone, but will be most useful to people working in media, marketing and communications.
    [Show full text]
  • Guide to Copyright and CC for Legal Services.Pdf
    The Legal Services Organization’s Guide to Copyright Law (& How to Make it Work For You) by Liz Leman and Brian Rowe, Legal Services National Technology Assistance Project Table of Contents Chapter 1: Copyright basics……………………………………………………………………2 Chapter 2: Creative Commons………………………………………………………………...8 Chapter 3: How to find and use content………………..……………………………...13 Chapter 4: Resources…………………………………………………………………………...15 • • • Legal Services National Technology Assistance Project www.LSNTAP.org Liz Leman & Brian Rowe Chapter 1: Copyright basics What copyright is In its most basic definition, copyright law is a set of automatically-conferred rights designed to reserve a creator’s rights to his or her own original works. Of course, there are a number of words in that sentence which are open to interpretation. Let’s take a look at what the Copyright Act (1976) says about each. Rights. There are two types of “rights” to define and consider here: economic and moral. If a creator has economic rights to his or her work, they have the right to get paid when other people utilize it in some way (see below for specific uses that are and are not allowed). If a creator has moral rights to his or her work, they have the right to allow or disallow use of their work as they see fit. This moral right is very weak under US copyright law; almost all rights granted are economic. This distinction between economic and moral rights is important in US copyright law because of the way the Copyright Act lays out a creator’s rights. The Act gives creators the right (in
    [Show full text]
  • The Sad, Sad Story of Social Wikipedia a Long and Drawn-Out Tale of Change, More Change, and No Change Whatsoever
    The Sad, Sad Story of Social Wikipedia A long and drawn-out tale of change, more change, and no change whatsoever. Necessary disclaimer Dates may be horribly wrong; assumptions of cause and effect may be terribly faulty; and general conclusions may be utterly arbitrary. This is all pieced together rather after the fact by someone who wasn't really there for any of it. Please don't eat me. The Before Times ● 1999: Nupedia ○ Precursor to Wikipedia, not really the same thing at all ○ Died horribly ● 2001: Wikipedia created using UseModWiki ○ Page-based storage engine written in perl ○ Discussions were inline or on the pages themselves UseModWiki 2001-2002: Phase 2 and Phase 3 ● Transition to new PHP platform ○ Phase 2: PHP attempt #1 ○ Phase 3: PHP attempt #2 - MediaWiki It looks basically the same. 2002: MediaWiki But what made it different? ● Wikitext ● Organisational tools ○ Namespaces ○ Watchlists, RecentChanges, Logs, and other special pages ○ Email notifications Discussion is now no longer in the articles themselves - separate namespaces are created as talk pages. Talk page formatting ● Discussions are separated with a header or a line between each ● Comments are signed and replies indented ● All formatting done manually Talk page editing == Top of talk page == Why do the instructions say to put this on the top of the talk page? I would never look at the top of my talk page for a message (until I can't find the new message at the bottom). -- <font color="#000080">Mufka</font> [[User:Mufka|<sup>(u)</sup>]] [[User talk:Mufka|<sup>(t)</sup>]] [[Special:Contributions/Mufka|<sup>(c)]]</sup> 12:24, 21 October 2008 (UTC) :Well I guess that's been resolved.
    [Show full text]
  • The GUM Corpus: Creating Multilayer Resources in the Classroom *
    The GUM Corpus: Creating Multilayer Resources in the Classroom * Amir Zeldes Georgetown University [email protected] Abstract This paper presents the methodology, design principles and detailed evaluation of a new freely available multilayer corpus, collected and edited via classroom annotation using collaborative software. After briefly discussing corpus design for open, extensible corpora, five classroom annotation projects are presented, covering structural markup in TEI XML, multiple part of speech tagging, constituent and dependency parsing, information structural and coreference annotation, and Rhetorical Structure Theory analysis. Layers are inspected for annotation quality and together they coalesce to form a richly annotated corpus that can be used to study the interactions between different levels of linguistic description. The evaluation gives an indication of the expected quality of a corpus created by students with relatively little training. A multifactorial example study on lexical NP coreference likelihood is also presented, which illustrates some applications of the corpus. The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy. 1 Introduction Among the trends in corpus linguistics in recent years, there are at least two developments that have promoted an explosion of complex language data becoming readily available: the advent of the age of multilayer annotations and the expansion of the base of corpus creators and related software to allow for collaborative, distributed annotation across space and time. Corpora have grown progressively more complex and multifactorial, going beyond tagged, or even syntactically annotated treebanks to encompass multiple, simultaneous levels of analysis.
    [Show full text]
  • Wikimedia Foundation
    Wikimedia Foundation A n n uAl 2007 2008 RepoRt [About the Wikimedia Foundation] We are the non-profit, 501(c)3 charitable foundation that operates Wikipedia and other free knowledge projects. The Foundation was established by Jimmy Wales in 2003, two years after creating Wikipedia, to build a long-term future for free knowledge projects on the internet. The Foundation, now based in San Francisco, California, maintains the technical infrastructure, software, and servers that allow millions of people every day to freely use Wikipedia and its sister projects. The Foundation also manages programs and partnerships that extend our basic mission: to expand free knowledge around the world. With just over 20 paid staff, the Foundation is the official representative for its projects, offering legal, administrative, and financial support. The Foundation is also responsible for conducting a range of fundraising activities, including the annual community giving campaign, to finance our annual plan and operations overhead. FRONT COVER: Wikimania 2007 volunteers photo by Everlong; CC-BY-SA from Wikimedia Commons 2007 2008 Annual Report [Mission Statement] Empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally. wikimediafoundation.org This annual report covers the Foundation’s 2007/2008 fiscal year: July 1, 2007 through June 30, 2008. [ 2 ] {Letter from the Executive Director} >> The Wikimedia Foundation is the organization behind Wikipedia, the world’s largest and most popular encyclopedia, and the only website in the global top 5 that’s run by a non-profit.
    [Show full text]