Text Corpora and the Challenge of Newly Written Languages
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Compilation of a Swiss German Dialect Corpus and Its Application to Pos Tagging
Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging Nora Hollenstein Noemi¨ Aepli University of Zurich University of Zurich [email protected] [email protected] Abstract Swiss German is a dialect continuum whose dialects are very different from Standard German, the official language of the German part of Switzerland. However, dealing with Swiss German in natural language processing, usually the detour through Standard German is taken. As writing in Swiss German has become more and more popular in recent years, we would like to provide data to serve as a stepping stone to automatically process the dialects. We compiled NOAH’s Corpus of Swiss German Dialects consisting of various text genres, manually annotated with Part-of- Speech tags. Furthermore, we applied this corpus as training set to a statistical Part-of-Speech tagger and achieved an accuracy of 90.62%. 1 Introduction Swiss German is not an official language of Switzerland, rather it includes dialects of Standard German, which is one of the four official languages. However, it is different from Standard German in terms of phonetics, lexicon, morphology and syntax. Swiss German is not dividable into a few dialects, in fact it is a dialect continuum with a huge variety. Swiss German is not only a spoken dialect but increasingly used in written form, especially in less formal text types. Often, Swiss German speakers write text messages, emails and blogs in Swiss German. However, in recent years it has become more and more popular and authors are publishing in their own dialect. Nonetheless, there is neither a writing standard nor an official orthography, which increases the variations dramatically due to the fact that people write as they please with their own style. -
Librarians As Wikimedia Movement Organizers in Spain: an Interpretive Inquiry Exploring Activities and Motivations
Librarians as Wikimedia Movement Organizers in Spain: An interpretive inquiry exploring activities and motivations by Laurie Bridges and Clara Llebot Laurie Bridges Oregon State University https://orcid.org/0000-0002-2765-5440 Clara Llebot Oregon State University https://orcid.org/0000-0003-3211-7396 Citation: Bridges, L., & Llebot, C. (2021). Librarians as Wikimedia Movement Organizers in Spain: An interpretive inQuiry exploring activities and motivations. First Monday, 26(6/7). https://doi.org/10.5210/fm.v26i3.11482 Abstract How do librarians in Spain engage with Wikipedia (and Wikidata, Wikisource, and other Wikipedia sister projects) as Wikimedia Movement Organizers? And, what motivates them to do so? This article reports on findings from 14 interviews with 18 librarians. The librarians interviewed were multilingual and contributed to Wikimedia projects in Castilian (commonly referred to as Spanish), Catalan, BasQue, English, and other European languages. They reported planning and running Wikipedia events, developing partnerships with local Wikimedia chapters, motivating citizens to upload photos to Wikimedia Commons, identifying gaps in Wikipedia content and filling those gaps, transcribing historic documents and adding them to Wikisource, and contributing data to Wikidata. Most were motivated by their desire to preserve and promote regional languages and culture, and a commitment to open access and open education. Introduction This research started with an informal conversation in 2018 about the popularity of Catalan Wikipedia, Viquipèdia, between two library coworkers in the United States, the authors of this article. Our conversation began with a sense of wonder about Catalan Wikipedia, which is ranked twentieth by number of articles, out of 300 different language Wikipedias (Meta contributors, 2020). -
Combating Attacks and Abuse in Large Online Communities
University of California Santa Barbara Combating Attacks and Abuse in Large Online Communities Adissertationsubmittedinpartialsatisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Gang Wang Committee in charge: Professor Ben Y. Zhao, Chair Professor Haitao Zheng Professor Christopher Kruegel September 2016 The Dissertation of Gang Wang is approved. Professor Christopher Kruegel Professor Haitao Zheng Professor Ben Y. Zhao, Committee Chair July 2016 Combating Attacks and Abuse in Large Online Communities Copyright c 2016 ⃝ by Gang Wang iii Acknowledgements I would like to thank my advisors Ben Y.Zhao and Haitao Zheng formentoringmethrough- out the PhD program. They were always there for me, giving me timely and helpful advice in almost all aspects of both research and life. I also want to thank my PhD committee member Christopher Kruegel for his guidance in my research projects and job hunting. Finally, I want to thank my mentors in previous internships: Jay Stokes, Cormac Herley, Weidong Cui and Helen Wang from Microsoft Research, and Vicente Silveira fromLinkedIn.Specialthanksto Janet Kayfetz, who has helped me greatly with my writing and presentation skills. Iamverymuchthankfultomycollaboratorsfortheirhardwork, without which none of this research would have been possible. First and foremost, to the members of SAND Lab at UC Santa Barbara: Christo Wilson, Bolun Wang, Tianyi Wang, Manish Mohanlal, Xiaohan Zhao, Zengbin Zhang, Xia Zhou, Ana Nika, Xinyi Zhang, Shiliang Tang, Alessandra Sala, Yibo Zhu, Lin Zhou, Weile Zhang, Konark Gill, Divya Sambasivan, Xiaoxiao Yu, Troy Stein- bauer, Tristan Konolige, Yu Su and Yuanyang Zhang. Second, totheemployeesatMicrosoft: Jack Stokes, Cormac Herley and David Felstead. Third, to Miriam Metzger from the Depart- ment of Communications at UC Santa Barbara, and Sarita Y. -
High Converting Email Marketing Templates
High Converting Email Marketing Templates Unwithering Rochester madder: he promulging his layings biographically and debatingly. Regan never foins any Jefferson geometrize stirringly, is Apostolos long and pinnatipartite enough? Isodimorphous Dylan always bloused his caddice if Bear is Iroquoian or motorises later. Brooks sports and the middleware can adjust your website very possibly be wonderful blog, email marketing templates here to apply different characteristics and Email apps you use everyday to automate your business and be more productive. For best results, go beyond just open and click rates to understand exactly how your newsletters impact your customer acquisition. Another favorite amongst most email designers, there had been a dip in the emails featuring illustrations. Turnover is vanity; profit is sanity. These people love your brand, so they would happy to contribute. Outlook, Gmail, Yahoo is not supporting custom fonts and ignore media query, it will be replaced by standard web font. Once Jack explained your background in greater detail, I immediately realized you would be the perfect person to speak with. Search for a phrase that your audience cares about. Drip marketing is easy to grasp and implement. These techniques include segmentation, mapping out email flows for your primary segments, and consolidating as much as possible into one app or platform. Typically, newsletters are used as a way to nurture customer relationships and educate your readership as opposed to selling a product. What problem do you solve? SPF authentication detects malicious emails from false email addresses. My sellers love the idea and the look you produce. Keep things fresh for your audience. -
The Scam Filter That Fights Back Final Report
The scam filter that fights back Final Report by Aurél Bánsági (4251342) Ruben Bes (4227492) Zilla Garama (4217675) Louis Gosschalk (4214528) Project Duration: April 18, 2016 – June 17, 2016 Project Owner: David Beijer, Peeeks Project Coach: Geert-Jan Houben, TU Delft Preface This report concludes the project comprising of the software development of a tool that makes replying to scam mails fast and easy. This project was part of the course TI3806 Bachelor Project in the fourth quarter of the acadamic year of 2015-2016 and was commissioned by David Beijer, CEO of Peeeks. The aim of this report is documenting the complete development process and presenting the results to the Bachelor Project Committee. During the past ten weeks, research was conducted and assessments were made to come to a design and was then followed by an implementation. We would like to thank our coach, Geert-Jan Houben, for his advice and supervision during the project. Aurél Bánsági (4251342) Ruben Bes (4227492) Zilla Garama (4217675) Louis Gosschalk (4214528) June 29, 2016 i Abstract Filtering out scam mails and deleting them does not hurt a scammer in any way, but wasting their time by sending fake replies does. As the amount of replies grows, the amount of time a scammer has to spend responding to e-mails increases. This bachelor project has created a revolutionary system that recommends replies that can be sent to answer a scam mail. This makes it very easy for users to waste a scammer’s time and destroy their business model. The project has a unique adaptation of the Elo rating system, which is a very popular algorithm used by systems ranging from chess to League of Legends. -
Social Turing Tests: Crowdsourcing Sybil Detection
Social Turing Tests: Crowdsourcing Sybil Detection Gang Wang, Manish Mohanlal, Christo Wilson, Xiao Wang‡, Miriam Metzger†, Haitao Zheng and Ben Y. Zhao Department of Computer Science, U. C. Santa Barbara, CA USA †Department of Communications, U. C. Santa Barbara, CA USA ‡Renren Inc., Beijing, China Abstract The research community has produced a substantial number of techniques for automated detection of Sybils [4, As popular tools for spreading spam and malware, Sybils 32,33]. However, with the exception of SybilRank [3], few (or fake accounts) pose a serious threat to online communi- have been successfully deployed. The majority of these ties such as Online Social Networks (OSNs). Today, sophis- techniques rely on the assumption that Sybil accounts have ticated attackers are creating realistic Sybils that effectively difficulty friending legitimate users, and thus tend to form befriend legitimate users, rendering most automated Sybil their own communities, making them visible to community detection techniques ineffective. In this paper, we explore detection techniques applied to the social graph [29]. the feasibility of a crowdsourced Sybil detection system for Unfortunately, the success of these detection schemes is OSNs. We conduct a large user study on the ability of hu- likely to decrease over time as Sybils adopt more sophis- mans to detect today’s Sybil accounts, using a large cor- ticated strategies to ensnare legitimate users. First, early pus of ground-truth Sybil accounts from the Facebook and user studies on OSNs such as Facebook show that users are Renren networks. We analyze detection accuracy by both often careless about who they accept friendship requests “experts” and “turkers” under a variety of conditions, and from [2]. -
Semantic Approaches in Natural Language Processing
Cover:Layout 1 02.08.2010 13:38 Seite 1 10th Conference on Natural Language Processing (KONVENS) g n i s Semantic Approaches s e c o in Natural Language Processing r P e g a u Proceedings of the Conference g n a on Natural Language Processing 2010 L l a r u t a This book contains state-of-the-art contributions to the 10th N n Edited by conference on Natural Language Processing, KONVENS 2010 i (Konferenz zur Verarbeitung natürlicher Sprache), with a focus s e Manfred Pinkal on semantic processing. h c a o Ines Rehbein The KONVENS in general aims at offering a broad perspective r on current research and developments within the interdiscipli - p p nary field of natural language processing. The central theme A Sabine Schulte im Walde c draws specific attention towards addressing linguistic aspects i t of meaning, covering deep as well as shallow approaches to se - n Angelika Storrer mantic processing. The contributions address both knowledge- a m based and data-driven methods for modelling and acquiring e semantic information, and discuss the role of semantic infor - S mation in applications of language technology. The articles demonstrate the importance of semantic proces - sing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting se - mantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. -
Read a Sample
Intelligent Computing for Interactive System Design provides a comprehensive resource on what has become the dominant paradigm in designing novel interaction methods, involving gestures, speech, text, touch and brain-controlled interaction, embedded in innovative and emerging human-computer interfaces. These interfaces support ubiquitous interaction with applications and services running on smartphones, wearables, in-vehicle systems, virtual and augmented reality, robotic systems, the Internet of Things (IoT), and many other domains that are now highly competitive, both in commercial and in research contexts. This book presents the crucial theoretical foundations needed by any student, researcher, or practitioner working on novel interface design, with chapters on statistical methods, digital signal processing (DSP), and machine learning (ML). These foundations are followed by chapters that discuss case studies on smart cities, brain-computer interfaces, probabilistic mobile text entry, secure gestures, personal context from mobile phones, adaptive touch interfaces, and automotive user interfaces. The case studies chapters also highlight an in-depth look at the practical application of DSP and ML methods used for processing of touch, gesture, biometric, or embedded sensor inputs. A common theme throughout the case studies is ubiquitous support for humans in their daily professional or personal activities. In addition, the book provides walk-through examples of different DSP and ML techniques and their use in interactive systems. Common terms are defined, and information on practical resources is provided (e.g., software tools, data resources) for hands-on project work to develop and evaluate multimodal and multi-sensor systems. In a series of in-chapter commentary boxes, an expert on the legal and ethical issues explores the emergent deep concerns of the professional community, on how DSP and ML should be adopted and used in socially appropriate ways, to most effectively advance human performance during ubiquitous interaction with omnipresent computers. -
Crowdsourcing Real-Time Viral Disease and Pest Information: A
The Sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2018) Crowdsourcing Real-Time Viral Disease and Pest Information: A Case of Nation-Wide Cassava Disease Surveillance in a Developing Country Daniel Mutembesa,1 Christopher Omongo,2 Ernest Mwebaze1 1Makerere University P.O Box 7062, Kampala, Uganda. 2National Crops Resources Research Institute P.O Box 7084. Kampala, Uganda. ([email protected], [email protected], [email protected]) Abstract Because they have to work within limited budgets, this sur- vey may be delayed or fewer regions may be sampled in In most developing countries, a huge proportion of the na- a particular year. These expert surveys whilst providing an tional food basket is supported by small subsistence agricul- annual snapshot of the health of cassava across the country, tural systems. A major challenge to these systems is disease are limited in their ability to provide real-time actionable and pest attacks which have a devastating effect on the small- holder farmers that depend on these systems for their liveli- surveillance data. hoods. A key component of any proposed solution is a good In many other fields, citizen science has proven to be a disease surveillance network. However, current surveillance good intervention to supplement long-term focused systems efforts are unable to provide sufficient data for monitoring such as the surveillance system employed in Uganda (Sil- such phenomena over a vast geographic area efficiently and vertown 2009). Citizen science and more specifically crowd- effectively due to limited resources, both human and finan- sourcing can be employed to supplement such a system by cial. -
Reciprocal Enrichment Between Basque Wikipedia and Machine Translation
Reciprocal Enrichment Between Basque Wikipedia and Machine Translation Inaki˜ Alegria, Unai Cabezon, Unai Fernandez de Betono,˜ Gorka Labaka, Aingeru Mayor, Kepa Sarasola and Arkaitz Zubiaga Abstract In this chapter, we define a collaboration framework that enables Wikipe- dia editors to generate new articles while they help development of Machine Trans- lation (MT) systems by providing post-edition logs. This collaboration framework was tested with editors of Basque Wikipedia. Their post-editing of Computer Sci- ence articles has been used to improve the output of a Spanish to Basque MT system called Matxin. For the collaboration between editors and researchers, we selected a set of 100 articles from the Spanish Wikipedia. These articles would then be used as the source texts to be translated into Basque using the MT engine. A group of volunteers from Basque Wikipedia reviewed and corrected the raw MT translations. This collaboration ultimately produced two main benefits: (i) the change logs that would potentially help improve the MT engine by using an automated statistical post-editing system , and (ii) the growth of Basque Wikipedia. The results show that this process can improve the accuracy of an Rule Based MT (RBMT) system in nearly 10% benefiting from the post-edition of 50,000 words in the Computer Inaki˜ Alegria Ixa Group, University of the Basque Country UPV/EHU, e-mail: [email protected] Unai Cabezon Ixa Group, University of the Basque Country, e-mail: [email protected] Unai Fernandez de Betono˜ Basque Wikipedia and University of the Basque Country, e-mail: [email protected] Gorka Labaka Ixa Group, University of the Basque Country, e-mail: [email protected] Aingeru Mayor Ixa Group, University of the Basque Country, e-mail: [email protected] Kepa Sarasola Ixa Group, University of the Basque CountryU, e-mail: [email protected] Arkaitz Zubiaga Basque Wikipedia and Queens College, CUNY, CS Department, Blender Lab, New York, e-mail: [email protected] 1 2 Alegria et al. -
Same Gap, Different Experiences
SAME GAP, DIFFERENT EXPERIENCES An Exploration of the Similarities and Differences Between the Gender Gap in the Indian Wikipedia Editor Community and the Gap in the General Editor Community By Eva Jadine Lannon 997233970 An Undergraduate Thesis Presented to Professor Leslie Chan, PhD. & Professor Kenneth MacDonald, PhD. in the Partial Fulfillment of the Requirements for the Degree of Honours Bachelor of Arts in International Development Studies (Co-op) University of Toronto at Scarborough Ontario, Canada June 2014 ACKNOWLEDGEMENTS I would like to express my deepest gratitude to all of those people who provided support and guidance at various stages of my undergraduate research, and particularly to those individuals who took the time to talk me down off the ledge when I was certain I was going to quit. To my friends, and especially Jennifer Naidoo, who listened to my grievances at all hours of the day and night, no matter how spurious. To my partner, Karim Zidan El-Sayed, who edited every word (multiple times) with spectacular patience. And finally, to my research supervisors: Prof. Leslie Chan, you opened all the doors; Prof. Kenneth MacDonald, you laid the foundations. Without the support, patience, and understanding that both of you provided, it would have never been completed. Thank you. i EXECUTIVE SUMMARY According to the second official Wikipedia Editor Survey conducted in December of 2011, female- identified editors comprise only 8.5% of contributors to Wikipedia's contributor population (Glott & Ghosh, 2010). This significant lack of women and women's voices in the Wikipedia community has led to systemic bias towards male histories and culturally “masculine” knowledge (Lam et al., 2011; Gardner, 2011; Reagle & Rhue, 2011; Wikimedia Meta-Wiki, 2013), and an editing environment that is often hostile and unwelcoming to women editors (Gardner, 2011; Lam et al., 2011; Wikimedia Meta- Wiki, 2013). -
Analyzing Wikidata Transclusion on English Wikipedia
Analyzing Wikidata Transclusion on English Wikipedia Isaac Johnson Wikimedia Foundation [email protected] Abstract. Wikidata is steadily becoming more central to Wikipedia, not just in maintaining interlanguage links, but in automated popula- tion of content within the articles themselves. It is not well understood, however, how widespread this transclusion of Wikidata content is within Wikipedia. This work presents a taxonomy of Wikidata transclusion from the perspective of its potential impact on readers and an associated in- depth analysis of Wikidata transclusion within English Wikipedia. It finds that Wikidata transclusion that impacts the content of Wikipedia articles happens at a much lower rate (5%) than previous statistics had suggested (61%). Recommendations are made for how to adjust current tracking mechanisms of Wikidata transclusion to better support metrics and patrollers in their evaluation of Wikidata transclusion. Keywords: Wikidata · Wikipedia · Patrolling 1 Introduction Wikidata is steadily becoming more central to Wikipedia, not just in maintaining interlanguage links, but in automated population of content within the articles themselves. This transclusion of Wikidata content within Wikipedia can help to reduce maintenance of certain facts and links by shifting the burden to main- tain up-to-date, referenced material from each individual Wikipedia to a single repository, Wikidata. Current best estimates suggest that, as of August 2020, 62% of Wikipedia ar- ticles across all languages transclude Wikidata content. This statistic ranges from Arabic Wikipedia (arwiki) and Basque Wikipedia (euwiki), where nearly 100% of articles transclude Wikidata content in some form, to Japanese Wikipedia (jawiki) at 38% of articles and many small wikis that lack any Wikidata tran- sclusion.