Author and Language Profiling of Short Texts
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
The Culture of Wikipedia
Good Faith Collaboration: The Culture of Wikipedia Good Faith Collaboration The Culture of Wikipedia Joseph Michael Reagle Jr. Foreword by Lawrence Lessig The MIT Press, Cambridge, MA. Web edition, Copyright © 2011 by Joseph Michael Reagle Jr. CC-NC-SA 3.0 Purchase at Amazon.com | Barnes and Noble | IndieBound | MIT Press Wikipedia's style of collaborative production has been lauded, lambasted, and satirized. Despite unease over its implications for the character (and quality) of knowledge, Wikipedia has brought us closer than ever to a realization of the centuries-old Author Bio & Research Blog pursuit of a universal encyclopedia. Good Faith Collaboration: The Culture of Wikipedia is a rich ethnographic portrayal of Wikipedia's historical roots, collaborative culture, and much debated legacy. Foreword Preface to the Web Edition Praise for Good Faith Collaboration Preface Extended Table of Contents "Reagle offers a compelling case that Wikipedia's most fascinating and unprecedented aspect isn't the encyclopedia itself — rather, it's the collaborative culture that underpins it: brawling, self-reflexive, funny, serious, and full-tilt committed to the 1. Nazis and Norms project, even if it means setting aside personal differences. Reagle's position as a scholar and a member of the community 2. The Pursuit of the Universal makes him uniquely situated to describe this culture." —Cory Doctorow , Boing Boing Encyclopedia "Reagle provides ample data regarding the everyday practices and cultural norms of the community which collaborates to 3. Good Faith Collaboration produce Wikipedia. His rich research and nuanced appreciation of the complexities of cultural digital media research are 4. The Puzzle of Openness well presented. -
Wikipedia Data Analysis
Wikipedia data analysis. Introduction and practical examples Felipe Ortega June 29, 2012 Abstract This document offers a gentle and accessible introduction to conduct quantitative analysis with Wikipedia data. The large size of many Wikipedia communities, the fact that (for many languages) the already account for more than a decade of activity and the precise level of detail of records accounting for this activity represent an unparalleled opportunity for researchers to conduct interesting studies for a wide variety of scientific disciplines. The focus of this introductory guide is on data retrieval and data preparation. Many re- search works have explained in detail numerous examples of Wikipedia data analysis, includ- ing numerical and graphical results. However, in many cases very little attention is paid to explain the data sources that were used in these studies, how these data were prepared before the analysis and additional information that is also available to extend the analysis in future studies. This is rather unfortunate since, in general, data retrieval and data preparation usu- ally consumes no less than 75% of the whole time devoted to quantitative analysis, specially in very large datasets. As a result, the main goal of this document is to fill in this gap, providing detailed de- scriptions of available data sources in Wikipedia, and practical methods and tools to prepare these data for the analysis. This is the focus of the first part, dealing with data retrieval and preparation. It also presents an overview of useful existing tools and frameworks to facilitate this process. The second part includes a description of a general methodology to undertake quantitative analysis with Wikipedia data, and open source tools that can serve as building blocks for researchers to implement their own analysis process. -
Approaches to NLP-Supported Information Quality Management in Wikipedia
The Quality of Content in Open Online Collaboration Platforms Approaches to NLP-supported Information Quality Management in Wikipedia Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) vorgelegt von Oliver Ferschke, M.A. geboren in Würzburg Tag der Einreichung: 15. Mai 2014 Tag der Disputation: 15. Juli 2014 Referenten: Prof. Dr. Iryna Gurevych, Darmstadt Prof. Dr. Hinrich Schütze, München Assoc. Prof. Carolyn Penstein Rosé, PhD, Pittsburgh Darmstadt 2014 D17 Please cite this document as URN: urn:nbn:de:tuda-tuprints-40929 URL: http://tuprints.ulb.tu-darmstadt.de/4092 This document is provided by tuprints, E-Publishing-Service of the TU Darmstadt http://tuprints.ulb.tu-darmstadt.de [email protected] This work is published under the following Creative Commons license: Attribution – Non Commercial – No Derivative Works 2.0 Germany http://creativecommons.org/licenses/by-nc-nd/2.0/de/deed.en Abstract Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively- created resource that reflects the spirit of the open collaborative content creation paradigm. Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source. -
A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft)
A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft) MIRIAM REDI, Wikimedia Foundation MARTIN GERLACH, Wikimedia Foundation ISAAC JOHNSON, Wikimedia Foundation JONATHAN MORGAN, Wikimedia Foundation LEILA ZIA, Wikimedia Foundation arXiv:2008.12314v2 [cs.CY] 29 Jan 2021 2 Miriam Redi, Martin Gerlach, Isaac Johnson, Jonathan Morgan, and Leila Zia EXECUTIVE SUMMARY In January 2019, prompted by the Wikimedia Movement’s 2030 strategic direction [108], the Research team at the Wikimedia Foundation1 identified the need to develop a knowledge gaps index—a composite index to support the decision makers across the Wikimedia movement by providing: a framework to encourage structured and targeted brainstorming discussions; data on the state of the knowledge gaps across the Wikimedia projects that can inform decision making and assist with measuring the long term impact of large scale initiatives in the Movement. After its first release in July 2020, the Research team has developed the second complete draft of a taxonomy of knowledge gaps for the Wikimedia projects, as the first step towards building the knowledge gap index. We studied more than 250 references by scholars, researchers, practi- tioners, community members and affiliates—exposing evidence of knowledge gaps in readership, contributorship, and content of Wikimedia projects. We elaborated the findings and compiled the taxonomy of knowledge gaps in this paper, where we describe, group and classify knowledge gaps into a structured framework. The taxonomy that you will learn more about in the rest of this work will serve as a basis to operationalize and quantify knowledge equity, one of the two 2030 strategic directions, through the knowledge gaps index. -
Book of Abstracts
xx Table of Contents Session O1 - Machine Translation & Evaluation . 1 Session O2 - Semantics & Lexicon (1) . 2 Session O3 - Corpus Annotation & Tagging . 3 Session O4 - Dialogue . 4 Session P1 - Anaphora, Coreference . 5 Session: Session P2 - Collaborative Resource Construction & Crowdsourcing . 7 Session P3 - Information Extraction, Information Retrieval, Text Analytics (1) . 9 Session P4 - Infrastructural Issues/Large Projects (1) . 11 Session P5 - Knowledge Discovery/Representation . 13 Session P6 - Opinion Mining / Sentiment Analysis (1) . 14 Session P7 - Social Media Processing (1) . 16 Session O5 - Language Resource Policies & Management . 17 Session O6 - Emotion & Sentiment (1) . 18 Session O7 - Knowledge Discovery & Evaluation (1) . 20 Session O8 - Corpus Creation, Use & Evaluation (1) . 21 Session P8 - Character Recognition and Annotation . 22 Session P9 - Conversational Systems/Dialogue/Chatbots/Human-Robot Interaction (1) . 23 Session P10 - Digital Humanities . 25 Session P11 - Lexicon (1) . 26 Session P12 - Machine Translation, SpeechToSpeech Translation (1) . 28 Session P13 - Semantics (1) . 30 Session P14 - Word Sense Disambiguation . 33 Session O9 - Bio-medical Corpora . 34 Session O10 - MultiWord Expressions . 35 Session O11 - Time & Space . 36 Session O12 - Computer Assisted Language Learning . 37 Session P15 - Annotation Methods and Tools . 38 Session P16 - Corpus Creation, Annotation, Use (1) . 41 Session P17 - Emotion Recognition/Generation . 43 Session P18 - Ethics and Legal Issues . 45 Session P19 - LR Infrastructures and Architectures . 46 xxi Session I-O1: Industry Track - Industrial systems . 48 Session O13 - Paraphrase & Semantics . 49 Session O14 - Emotion & Sentiment (2) . 50 Session O15 - Semantics & Lexicon (2) . 51 Session O16 - Bilingual Speech Corpora & Code-Switching . 52 Session P20 - Bibliometrics, Scientometrics, Infometrics . 54 Session P21 - Discourse Annotation, Representation and Processing (1) . 55 Session P22 - Evaluation Methodologies . -
The Writing Process in Online Mass Collaboration NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction
The Writing Process in Online Mass Collaboration NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Dr.-Ing. vorgelegt von Johannes Daxenberger, M.A. geboren in Rosenheim Tag der Einreichung: 28. Mai 2015 Tag der Disputation: 21. Juli 2015 Referenten: Prof. Dr. Iryna Gurevych, Darmstadt Prof. Dr. Karsten Weihe, Darmstadt Assoc. Prof. Ofer Arazy, Ph.D., Alberta Darmstadt 2016 D17 Please cite this document as URN: urn:nbn:de:tuda-tuprints-52259 URL: http://tuprints.ulb.tu-darmstadt.de/5225 This document is provided by tuprints, E-Publishing-Service of the TU Darmstadt http://tuprints.ulb.tu-darmstadt.de [email protected] This work is published under the following Creative Commons license: Attribution – Non Commercial – No Derivative Works 3.0 Germany http://creativecommons.org/licenses/by-nc-nd/3.0/de/deed.en Abstract In the past 15 years, the rapid development of web technologies has created novel ways of collaborative editing. Open online platforms have attracted millions of users from all over the world. The open encyclopedia Wikipedia, started in 2001, has become a very prominent example of a largely successful platform for collaborative editing and knowledge creation. The wiki model has enabled collaboration at a new scale, with more than 30,000 monthly active users on the English Wikipedia. Traditional writing research deals with questions concerning revision and the writing process itself. The analysis of collaborative writing additionally raises questions about the interaction of the involved authors. Interaction takes place when authors write on the same document (indirect interaction), or when they coordinate the collaborative writing process by means of communication (direct interaction). -
Simple English Wikipedia from Wikipedia, the Free Encyclopedia
Simple English Wikipedia From Wikipedia, the free encyclopedia The Simple English Wikipedia is a Wikipedia encyclopedia, written in basic English.[1] Simple English Wikipedia Articles in the Simple English Wikipedia use shorter sentences and simpler words and grammar than the English Wikipedia. The Simple English Wikipedia is also for people with different needs. Some examples of people who use Simple English Wikipedia: Students Children Adults who might find it hard to learn or read People who are learning English Screenshot Other people use the Simple English Wikipedia because the simple language helps them understand hard ideas or topics they do not know about. When the Simple English Wikipedia began in 2003, the ordinary English Wikipedia already had 150,000 articles, and seven other Wikipedias in other languages had over 15,000 articles. Since the other Wikipedias already have so many articles, most Simple English articles take articles from other Wikipedias and make them simple; they are usually not new articles. Right now, the Simple English Wikipedia has 119,975 articles. As of May 2016, it was the 51st largest Wikipedia.[2] This makes Simple English articles a good way to understand difficult articles from the ordinary English Wikipedia. If someone cannot understand an idea in complex English, they can read the Simple English article. Many articles are shorter than the same articles in the English Wikipedia.[3] Technical subjects use some terms which are not simple. Editors try to explain these terms in a simple way. Related pages Wikipedia:Simple English Wikipedia Simple English Wiktionary References 1. Tim Dowling (January 14, 2008). -
The Sum of All Human Knowledge”: a Systematic Review of Scholarly Research on the Content of Wikipedia
Downloaded from orbit.dtu.dk on: Sep 30, 2021 “The Sum of All Human Knowledge”: A Systematic Review of Scholarly Research on the Content of Wikipedia Mesgari, Mostafa; Okoli, Chitu; Mehdi, Mohamad; Nielsen, Finn Årup; Lanamäki, Arto Published in: American Society for Information Science and Technology. Journal Link to article, DOI: 10.1002/asi.23172 Publication date: 2015 Link back to DTU Orbit Citation (APA): Mesgari, M., Okoli, C., Mehdi, M., Nielsen, F. Å., & Lanamäki, A. (2015). “The Sum of All Human Knowledge”: A Systematic Review of Scholarly Research on the Content of Wikipedia. American Society for Information Science and Technology. Journal, 66(2), 219-245. https://doi.org/10.1002/asi.23172 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. “The sum of all human knowledge”: A systematic review of scholarly research -
Wikipedia from Wikipedia, the Free Encyclopedia This Article Is About the Internet Encyclopedia
Wikipedia From Wikipedia, the free encyclopedia This article is about the Internet encyclopedia. For other uses, see Wikipedia ( disambiguation). For Wikipedia's non-encyclopedic visitor introduction, see Wikipedia:About. Page semi-protected Wikipedia A white sphere made of large jigsaw pieces, with letters from several alphabets shown on the pieces Wikipedia wordmark The logo of Wikipedia, a globe featuring glyphs from several writing systems, mo st of them meaning the letter W or sound "wi" Screenshot Main page of the English Wikipedia Main page of the English Wikipedia Web address wikipedia.org Slogan The free encyclopedia that anyone can edit Commercial? No Type of site Internet encyclopedia Registration Optional[notes 1] Available in 287 editions[1] Users 73,251 active editors (May 2014),[2] 23,074,027 total accounts. Content license CC Attribution / Share-Alike 3.0 Most text also dual-licensed under GFDL, media licensing varies. Owner Wikimedia Foundation Created by Jimmy Wales, Larry Sanger[3] Launched January 15, 2001; 13 years ago Alexa rank Steady 6 (September 2014)[4] Current status Active Wikipedia (Listeni/?w?k?'pi?di?/ or Listeni/?w?ki'pi?di?/ wik-i-pee-dee-?) is a free-access, free content Internet encyclopedia, supported and hosted by the non -profit Wikimedia Foundation. Anyone who can access the site[5] can edit almost any of its articles. Wikipedia is the sixth-most popular website[4] and constitu tes the Internet's largest and most popular general reference work.[6][7][8] As of February 2014, it had 18 billion page views and nearly 500 million unique vis itors each month.[9] Wikipedia has more than 22 million accounts, out of which t here were over 73,000 active editors globally as of May 2014.[2] Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001. -
Towards an On-Demand Simple Portuguese Wikipedia
Towards an on-demand Simple Portuguese Wikipedia Arnaldo Candido Junior Ann Copestake Institute of Mathematics and Computer Sciences Computer Laboratory University of São Paulo University of Cambridge arnaldoc at icmc.usp.br Ann.Copestake at cl.cam.ac.uk Lucia Specia Sandra Maria Aluísio Research Group in Computational Linguistics Institute of Mathematics and Computer Sciences University of Wolverhampton University of São Paulo l.specia at wlv.ac.uk sandra at icmc.usp.br Abstract Language (the set of guidelines to write simplified texts) are related to both syntax and lexical levels: The Simple English Wikipedia provides a use short sentences; avoid hidden verbs; use active simplified version of Wikipedia's English voice; use concrete, short, simple words. articles for readers with special needs. A number of resources, such as lists of common However, there are fewer efforts to make words3, are available for the English language to information in Wikipedia in other help users write in Simple English. These include languages accessible to a large audience. lexical resources like the MRC Psycholinguistic This work proposes the use of a syntactic Database4 which helps identify difficult words simplification engine with high precision using psycholinguistic measures. However, rules to automatically generate a Simple resources as such do not exist for Portuguese. An Portuguese Wikipedia on demand, based on exception is a small list of simple words compiled user interactions with the main Portuguese as part of the PorSimples project (Aluisio et al., Wikipedia. Our estimates indicated that a 2008). human can simplify about 28,000 Although the guidelines from the Plain occurrences of analysed patterns per Language can in principle be applied for many million words, while our system can languages and text genres, for Portuguese there are correctly simplify 22,200 occurrences, with very few efforts using Plain Language to make estimated f-measure 77.2%.