Language Independent Named Entity Recognition

Total Page:16

File Type:pdf, Size:1020Kb

Language Independent Named Entity Recognition LANGUAGE INDEPENDENT NAMED ENTITY RECOGNITION Thesis submitted in partial fulfillment of the requirements for the degree of Master Of Science by Research in Computer Science by MAHATHI BHAGAVATULA 201007004 [email protected] SEARCH INFORMATION EXTRACTION AND RETRIEVAL LAB International Institute of Information Technology Hyderabad - 500 032, INDIA DECEMBER 2012 Copyright c Mahathi Bhagavatula, 2012 All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Language Independent Named Entity Recogni- tion” by Mahathi Bhagavatula, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. Vasudeva Varma To my mother Anantha Lakshmi, father Kutumbarao and all my dear ones Acknowledgments First of all, I would like to thank my advisor Prof: Vasudeva Varma, for every thing he has done for me. Firstly, for the freedom he has given to me for pursuing my research and the kind of support he has given me at every stage where I was deviating from my research work. His regular suggestions have been a great value. It was pleasure and joy working with him.His constant guidance and motivation throughout the course was invaluable and it kept me going in research. Then I would take the oppurtunity to thank my parents B.Kutumba Rao and B. Anantha Lakshmi for their continous encouragement and support during the course. I thank them for the freedom they have given me throughout my research. I would like to thank even my brother Yashaswi and my sister Ra- mayendu for their encouragement throughout the course. I sincerely thank my lab mate Santosh GSK without whom it would have been difficult to get through my thesis so early. I would thank him for the moral support in dull days and for the knowledge he has shared with me throughout my research. I would like also thank my friends Ruchi, Deepthi, Swagathika, Vikram, Jatin, Nikhil and Sushma for all kinds of motivation and encouragement they have given me throughout my course. I would like to extent my gratitude to my other labmates Kiran, Sudheer, Srikanth and Aditya who guided me at various stages. v Abstract The role of Internet in personal, economic and political advancement is growing in a fast pace. By the turn of century, data on web reaches to petabytes or exabytes or may even scale up-to unimaginable quantities. Extraction of precise and structured information from such large amounts of unstructured or semi-structured data is the major concern of web known as Information Extraction. Named entity recognition (NER) (also known as entity identification and entity extraction) is one of the important subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, monetary values, per- centages, expressions of times, etc. NER has many applications in NLP, for e.g., in data classification, question answering, cross language information access, machine translation system, query processing, etc. Recognizing Named Entities (NEs) in English has reached accuracies nearing to 98%. For English, many cues aid to know the structure of language (one such important cue in identifying NEs is capi- talization) which made the accuracies to be high. Whereas in Indian languages, there are no such cues available and moreover each Indian language differ from the other in grammatical structure. Hence, developing a language independent NER is a challenging task. Previous works includes developing an NER system using language dependent tools such as POS Tagger, dictionaries, Chunk Tagger, gazetteer lists, etc., or they have used linguistic experts to manu- ally tag the training and testing data or linguistic experts used to generate rules for recognizing NEs. Language Independent approaches include supervised machine learning techniques such as CRF, HMM, MEMM, SVM, etc. These techniques need High amounts of manually tagged data which is again a point of concern. Some of the other approaches include exploiting the external knowledge such as Wikipedia. But, in those methods the utilization of Wikipedia is not complete. Hence, the main objective of this work is to build a language independent NER system without any manual intervention and without any usage of language dependent tools. The approach specified throughout the work, includes language independent methods to identify, extract and recognize the NEs. Identification of NEs is done using an External Knowledge namely vi vii Wikipedia. More specifically, English Wikipedia is used as an aid to derive the NEs from Indian lan- guages. Wikipedia hierarchal structure is explored and the documents in it are divided into specific domains. Each domain is considered and the corresponding English and Indian language documents are clustered. English documents are tagged using the Stanford NER Tagger and the non-NEs are removed. Using the term co-occurrences between the tagged English and non-tagged Indian language words, the corresponding NEs between Indian language and English are mapped. Thus the tag of English NE is duplicated to the Indian language NE. Hence, the Indian language data is tagged. The tagged data generated in previous step, is used in recognition of NEs on sets of monolingual Indian language documents. In this step, a set of features are generated from the words of these docu- ments and these features are used for recognition of NEs in a new document. Consider each document; extract the tagged data from the document using the data from previous step. Now, from the remaining words of the document, a Naive Bayes Classifier is build which uses these words to generate a set of features for each class (features here are nothing but the important words of a particular class in that document). The importance of these features is calculated statistically by different metrics (the metrics for classification). Now given a new document, the presence of these features along with their scores is calculated. If the score exceed a threshold, implies the presence of NEs in the document. By decreasing the size of document the process is repeated again till we get the NE. Hence, the monolingual Indian language document is tagged. The approach specified in identifying and recognizing the NEs is language independent and can be extended to any language as none of the language dependent tools are used or there is no involvement of linguistic experts. Hindi, Marathi and Telugu were the languages in which the work has been done. PERSON, LOCATION and ORGANIZATION were the tag of NEs used throughout the identification and recognition process. Wikipedia is used as a dataset in identifying the NEs. Around 3,05,574 English documents, Hindi 100,000 documents, Marathi 83,000 documents, Telugu 85,000 documents are used to generate the results. The results are evaluated on manually tagged 2328, 1658, 2200 Hindi, Marathi and Telugu Wikipedia documents respectively. The F-Measure scores are 80.42 for Hindi, 81.25 for Marathi and 79.98 for Telugu. Dataset for recognition of NEs is a set of 33,435 documents of FIRE corpus for Hindi and 46,892 Telugu documents crawled from web. F-measure scores of Hindi and Telugu are 81.8 and 81.6, evalu- ated on 9,000 and 12,000 Hindi and Telugu manually tagged documents respectively. Baseline system used here are with F-Measure scores nearly 56.81 and 44.91 for Hindi and Telugu respectively. viii The above results are quite encouraging and they outperform the baseline systems. Moreover, the approach specified is language independent, unlike the baseline systems which depends on language resources at some time throughout their process. In-spite of being language independent the approach specified could able to reach the accuracies which makes the system successful. Contents Chapter Page 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 1.1 Language Independent Named Entity Recognition . 2 1.2 Problem Definition . 4 1.2.1 Motivation . 4 1.2.2 Problem Statement . 4 1.2.3 Challenges . 5 1.2.3.1 Variation in NEs . 5 1.2.3.2 Spell variations in NEs . 5 1.2.3.3 Disambiguation in the forms of NE . 5 1.2.3.4 Ambiguity with common noun . 6 1.3 Overview of proposed solutions . 6 1.3.1 Named Entity Identification . 7 1.3.2 Named Entity Recognition . 8 1.4 Contributions . 8 1.5 Thesis Organization . 9 2 Related Work ::::::::::::::::::::::::::::::::::::::::: 11 2.1 Language-Dependent Approaches . 11 2.1.1 Rule-Based approaches . 11 2.1.2 Approaches making use of Dictionaries and gazetteer lists . 12 2.1.3 Advantages . 12 2.1.4 Disadvantages . 13 2.2 Semi-Language-Dependent Approaches . 13 2.2.1 Hidden Markov Models (HMMs) . 13 2.2.2 Maximum Entropy Markov Models (MEMMs) . 13 2.2.3 Conditional Random Fields (CRF) . 14 2.2.4 Support Vector Machine (SVM) . 14 2.2.5 Decision Tree (DT) . 15 2.2.6 Hybrid of above approaches . 15 2.2.7 Advantages . 15 2.2.8 Disadvantages . 15 2.3 Language-Independent Approaches . 16 2.3.1 Approaches using Wikipedia . 16 2.3.2 Advantages . 17 2.3.3 Disadvantages . 17 ix x CONTENTS 3 Named Entity Identification :::::::::::::::::::::::::::::::::: 18 3.1 Role of Wikipedia in Identification of Named Entities . 18 3.1.1 Limitations of Previous Approaches . 18 3.1.2 Enhancements of this Approach . 18 3.1.3 Structure of Wikipedia . 19 3.1.3.1 Category links . 19 3.1.3.2 Inter-Language links . 19 3.1.3.3 Subtitles of the document . 19 3.1.3.4 Abstract . 19 3.1.3.5 Infobox . 20 3.2 Overview of the Approach . 20 3.3 Clustering of Similar documents . 20 3.3.1 Hierarchical Clustering without using Category Information of Wikipedia .
Recommended publications
  • Arxiv:2010.11856V3 [Cs.CL] 13 Apr 2021 Questions from Non-English Native Speakers to Rep- Information-Seeking Questions—Questions from Resent Real-World Applications
    XOR QA: Cross-lingual Open-Retrieval Question Answering Akari Asaiº, Jungo Kasaiº, Jonathan H. Clark¶, Kenton Lee¶, Eunsol Choi¸, Hannaneh Hajishirziº¹ ºUniversity of Washington ¶Google Research ¸The University of Texas at Austin ¹Allen Institute for AI {akari, jkasai, hannaneh}@cs.washington.edu {jhclark, kentonl}@google.com, [email protected] Abstract ロン・ポールの学部時代の専攻は?[Japanese] (What did Ron Paul major in during undergraduate?) Multilingual question answering tasks typi- cally assume that answers exist in the same Multilingual document collections language as the question. Yet in prac- (Wikipedias) tice, many languages face both information ロン・ポール (ja.wikipedia) scarcity—where languages have few reference 高校卒業後はゲティスバーグ大学へ進学。 (After high school, he went to Gettysburg College.) articles—and information asymmetry—where questions reference concepts from other cul- Ron Paul (en.wikipedia) tures. This work extends open-retrieval ques- Paul went to Gettysburg College, where he was a member of the Lambda Chi Alpha fraternity. He tion answering to a cross-lingual setting en- graduated with a B.S. degree in Biology in 1957. abling questions from one language to be an- swered via answer content from another lan- 生物学 (Biology) guage. We construct a large-scale dataset built on 40K information-seeking questions Figure 1: Overview of XOR QA. Given a question in across 7 diverse non-English languages that Li, the model finds an answer in either English or Li TYDI QA could not find same-language an- Wikipedia and returns an answer in English or L . L swers for. Based on this dataset, we introduce i i is one of the 7 typologically diverse languages.
    [Show full text]
  • Role of Libraries in Wikipedia Content Development
    Role of libraries in Wikipedia content development Dr Vimal Kumar V. Technical Assistant Mahatma Gandhi University Library Kerala State, India LISACON-2020 National Virtual Conference Introduction Encyclopedias are a collection of articles summarized from primary and secondary information sources. Centralised editorial activity is the main highlight of traditional encyclopedias. The fundamental concept of traditional encyclopedia changed with the arrival of online alternatives like Wikipedia. The main features of Wikipedia are Multilingual, Open content and Free. Wikipedia introduced decentralised editorial activity, dependent on volunteers. 1 Wikipedia in Indian languages Wikipedia's Indian language editions became active after the introduction of the Unicode standard. The efforts of Indic Project and SMC have contributed to the development of tools for local languages. As per Wikimedia Statistics India consistently maintains 5th rank in page viewing in the country-wise ranking. 2 Article strength of South Indian languages Sl. Wikipedia Edition No. of Year No. articles established 1. Tamil Wikipedia 1,30,122 2003 2. Malayalam Wikipedia 69,911 2002 3. Telugu Wikipedia 69,739 2003 4. Kannada Wikipedia 26,397 2003 Source: Number of articles as on 1 August 2020 culled from stats.wikimedia.org 3 Ratio of editors Language Ratio (For every million speakers) Tamil 1 Malayalam 4 Telugu 0.7 Kannada 0.7 4 How Wikipedia works The community members power up the Wikipedia. There are two groups in the community: Wikipedia readers and Content contributors. Wikipedia content editors are known as Wikipedians. The main function of Wikipedians is to create new articles, add new content to existing articles, and make changes to the content.
    [Show full text]
  • Arabic Wikipedia As an Example
    Which tools to manage a medium- sized version of Wikipedia? Arabic Wikipedia as an example Helmi HAMDI, M. Sc. / M. Env. Username : Helmoony Wikiarabia 2015. Monastir, Tunisia April 5, 2015 Summary • Community goals • Current management approach in Arabic Wikipedia • Tools Recommendations List of Wikipedias by speakers per article No Wiki version Speakers Articles per 1,000 speakers Mainly constructed, regional 1 Volapuk 200 600420 and « bot-friendly » versions 9 Scots 100,000 305 59 French 74,980,460 21.5 69 English 505,000,000 9.6 Arabic and Hindi wikipedias 104 Arabic 236,748,330 1.5 face the same situation : low ratio of articles per 124 Hindi 260,333,620 0.4 speakers Arabic Wikipedia in the next 5 years Our objective is to be in the Top10 with a minimum of 5% quality content and an optimized Present situation way of managing. around 350 000 articles 1% quality articles http://www.worldbridgerdesign.com/blog/tag/learning/ Current management approach in Arabic Wikipedia Arabic Wikipedia We are copying everything from the English Wikipedia (policies, content depth, tools, etc.). Does it help us to achieve our objective ? English Wikipedia Limits of the current management approch Arabic version of the village pump • No priorities • No task list The number of tools doesn’t help us to gather our forces. Letters to the community Empty chatroom When to use the village pump and when to use the mailing list ? We have a wikiproject in the Japanese language and an other one for Twilight ! We have a Wikiproject for the metro of Paris and none about France or Europe Who is going to participate in a Wikiproject for a metro in a European city ? And for how long ? WikiProjects… or User projects? • More than 60 projects • Users mix task forces or missions with projects • No structure to link between the projects.
    [Show full text]
  • Annual Report
    2012 | ANNUAL REPORT Students in a Digital Classroom CIS ANNUAL REPORT (APRIL 2012 – MARCH 2013) _____________________________________________________________________ Contents Highlights ........................................................................................................................................ 3 Accessibility ..................................................................................................................................... 4 Access to Knowledge ...................................................................................................................... 7 Openness ...................................................................................................................................... 10 Internet Governance ..................................................................................................................... 20 Telecom ......................................................................................................................................... 43 Digital Natives ............................................................................................................................... 45 Researchers@Work ...................................................................................................................... 48 Credibility Alliance Norms Compliance ......................................................................................... 50 International Travel (2012-13) .....................................................................................................
    [Show full text]
  • Wikimedia India Newsletter, September 2010
    Wikimedia India Community Newsletter Copyright The text of this newsletter is copyrighted and is formally licensed to the public under liberal license "Creative Commons Attribution-Share alike 3.0 Unported License (CC-BY-SA)". This newsletter as a whole (including this copyright statement) or the content of this newsletter can be copied, modified, and redistributed if and only if the copied version is made available on similar license terms. Every copied, modified or redistributed version of this newsletter request to attribute the authors of this newsletter (a link back to the original document or a word about it generally satisfy the attribution requirement). Reuse of Logos of the Wikimedia Foundation is strictly restricted. The logo of Wikimedia foundation, wikipedia, and the logo of other wiki projects are used in this newsletter as per the trademark policy of Wikimedia foundation. Usage of logos in media and press reports about Wikimedia and its projects is permitted, any other usage needs explicit permission. Content of this document is covered by a disclaimer. Disclaimer The items contained herein are published as submitted and are provided for general information purposes only. This information is not advice. Readers should not rely solely on this information, but should make their own inquiries before making any decisions. Authors behind this newsletter work to maintain up-to-date information from reliable sources; however, no responsibility is accepted for any errors or omissions or results of any actions based upon this information. If you have any questions regarding any of these items, contact back. This newsletter may contain links to websites that are created and maintained by other volunteers outside this newsletter and it is not guarantee the accuracy or completeness of any information presented there.
    [Show full text]
  • Linguistic Neighbourhoods: Explaining Cultural Borders on Wikipedia Through Multilingual Co-Editing Activity
    Samoilenko et al. RESEARCH Linguistic neighbourhoods: Explaining cultural borders on Wikipedia through multilingual co-editing activity Anna Samoilenko1,3*, Fariba Karimi1, Daniel Edler2, J´er^omeKunegis3 and Markus Strohmaier1,3 *Correspondence: [email protected] Abstract 1GESIS { Leibniz-Institute for the Social Sciences, 6-8 Unter In this paper, we study the network of global interconnections between language Sachsenhausen, 50667 Cologne, communities, based on shared co-editing interests of Wikipedia editors, and show Germany that although English is discussed as a potential lingua franca of the digital Full list of author information is available at the end of the article space, its domination disappears in the network of co-editing similarities, and instead local connections come to the forefront. Out of the hypotheses we explored, bilingualism, linguistic similarity of languages, and shared religion provide the best explanations for the similarity of interests between cultural communities. Population attraction and geographical proximity are also significant, but much weaker factors bringing communities together. In addition, we present an approach that allows for extracting significant cultural borders from editing activity of Wikipedia users, and comparing a set of hypotheses about the social mechanisms generating these borders. Our study sheds light on how culture is reflected in the collective process of archiving knowledge on Wikipedia, and demonstrates that cross-lingual interconnections on Wikipedia are not dominated by one powerful language. Our findings also raise some important policy questions for the Wikimedia Foundation. Keywords: Wikipedia; Multilingual; Cultural similarity; Network; Digital language divide; Socio-linguistics; Digital Humanities; Hypothesis testing 1 Introduction Measuring the extent to which cultural communities overlap via the knowledge they preserve can paint a picture of how culturally proximate or diverse they are.
    [Show full text]
  • Common Issues Faced by Indic Wikipedia
    Indic Wikipedia Policies & Guidelines Handbook Table of content Preface Introduction to policies Types of policies Features of a policy page Necessity of policies and guidelines Creating policies Proposing Village pump Article or project talk page Policy page and its talk page Initial proposal Highlighting important discussion Discussing Consensus Implementing Modifying or updating an existing policy Enforcements Common issues faced by Indic Wikipedia communities Missing or incomplete policy pages Incomplete or untranslated policy pages Lack of active translators/editors Addressing the issues Dedicated team or task force Using MediaWiki translation tool Policy mapping Credits Images Text Screenshots Planning suggestions Proofreading: Preface Currently CIS-A2K is working with five Indian-language Wikimedia communities (Kannada, Konkani, Marathi, Odia and Telugu). While working with the mentioned Indic Wikimedia communities, we observed a number of issues affecting them and we also noticed that there are many similarities between the issues and difficulties faced by these communities. So, we decided to create this “Indic Wikipedia Policies and Guidelines Handbook”. At first, we created a short handbook discussing a number of topics, such as how to create new policies, or modify the existing ones, using village pump, enforcing policies etc. Then we talked to Indic Wikipedians to know more about the policy and guideline related issues and problems they are facing. We also asked for their feedback on the first draft of this handbook. When we contacted them and requested them to join our survey, we received overwhelming responses from them. We must thank everyone who has taken part in our surveys and we will continue communicating with Indic Wikimedians.
    [Show full text]
  • The India Chronicles Dear Community
    September 2011 By Tory Read Growing Wikipedia: The India Chronicles Dear Community, As the Wikimedia Foundation began its catalyst work, we commissioned documentarian Tory Read to create a vivid description of our work in India during the important early stages of our activities. This was done in the interest of transparency and to ensure that we captured lessons from this new approach. It also serves as a window into some of the exciting developments in the Indian Wikimedia community. Our goal is to honestly communicate about our work in this new arena and to stimulate dialogue about diverse ways to support and build Wikipedia communities and Wikimedia projects around the world. We hope that you take away a nuanced understanding of the work in India. We encourage you to tell us what you think and ask informed questions as this work continues to unfold. Sincerely, Barry Newstead Chief Global Development Officer, Wikimedia Foundation This is a journalistic account and analysis, based on document review, interviews and observations conducted between November 2010 and June 2011, including 16 days in India in June 2011. I planned and organized my visit based on where the most Wikipedia activities were happening at the time. The Malayalam Wikipedia community had been planning their annual meetup, and they scheduled it to fall within my travel dates so I could report on it. The views expressed herein are my own and do not necessarily reflect the views of Wikimedia Foundation. Tory Read, documentarian Growing Wikipedia: The India Chronicles | September 2011 2 “Wikipedia saved my life.” That’s what Srikeit Tadepalli, an MBA student in Pune, India, told me one day in June.
    [Show full text]
  • Cultural Neighbourhoods, Or Approaches to Quantifying Cultural Contextualisation in Multilingual Knowledge Repository Wikipedia
    CULTURAL NEIGHBOURHOODS, OR APPROACHES TO QUANTIFYING CULTURAL CONTEXTUALISATION IN MULTILINGUAL KNOWLEDGE REPOSITORY WIKIPEDIA by Anna Samoilenko Approved Dissertation thesis for the partial fulfillment of the requirements for a Doctor of Natural Sciences (Dr. rer. nat.) Fachbereich 4: Informatik Universität Koblenz-landau Chair of PhD Board: Prof. Dr. Ralf Lämmel Chair of PhD Commission: Prof. Dr. Stefan Müller Examiner and Supervisor: Prof. Dr. Steffen Staab Further Examiners: Prof. Dr. Brent Hecht, Jun.-Prof. Dr. Tobias Krämer Date of the doctoral viva: 16 June 2021 iii Cultural Neighbourhoods, or approaches to quantifying cultural contextualisation in multilingual knowledge repository Wikipedia by Anna SAMOILENKO Abstract As a multilingual system, Wikipedia provides many challenges for academics and engineers alike. One such challenge is cultural contextualisation of Wikipedia content, and the lack of approaches to effectively quantify it. Additionally, what seems to lack is the intent of establishing sound computational practices and frameworks for measuring cultural variations in the data. Current approaches seem to mostly be dictated by the data availability, which makes it difficult to apply them in other contexts. Another common drawback is that they rarely scale due to a significant qualitative or translation effort. To address these limitations, this thesis develops and tests two modular quantitative approaches. They are aimed at quantifying culture-related phenomena in systems which rely on multilingual user-generated content. In particular, they allow to: (1) operationalise a custom concept of cul- ture in a system; (2) quantify and compare culture-specific content- or coverage biases in such a system; and (3) map a large scale landscape of shared cultural interests and focal points.
    [Show full text]
  • Annual Report
    2013-14 | ANNUAL REPORT Pictured above: Posters exhibited during CIS 5 year celebrations in its office in Bangalore CIS ANNUAL REPORT (APRIL 2013 – MARCH 2014) _____________________________________________________________________ Contents Highlights ........................................................................................................................................ 3 Accessibility and Inclusion ............................................................................................................. 5 Access to Knowledge .................................................................................................................... 12 Internet Governance ...................................................................................................................... 33 Knowledge Repository on Internet Access ................................................................................... 51 Telecom......................................................................................................................................... 53 Digital Natives .............................................................................................................................. 55 Digital Humanities ........................................................................................................................ 58 Credibility Alliance Norms Compliance ...................................................................................... 61 2 CIS ANNUAL REPORT (APRIL 2013 – MARCH 2014) _____________________________________________________________________
    [Show full text]
  • 2008 by Phoebe Ayers, Ben Yates, and Charles Matthews Books Messages, 102, 201–202, 201 Copyvio
    INDEX Symbols & Numbers anniversaries. See date-related articles authority, arguing from, 54–55, 57 anonymous editors, 302, 304–305, 325 authors, of articles. See editors <!-- and -->, in hidden comments, 158 April Fools’ Day main page, 353 autobiography, 207. See also Conflict of Interest ' (apostrophe), in bold and italic text, 145 Arbitration, policy on (ARB), 375 guideline, 378 * (asterisk), in bulleted lists, 147 Arbitration Committee, 398–400 autoconfirm, 303 : (colon), indented lines with, 146 cases, 399–400, 400 AutoWiki Browser (AWB), 210 {{ }} (curly brackets), templates and, 145, 270 arguments. See disputes awards, for editors, 333–334 == (equal signs), in sections, 155–156 article history. See page history AWB (AutoWiki Browser), 210 | (pipe character) article message boxes. See templates, image parameters, 267 uses of, warning B internal links, 149 article namespace, 27–28 table syntax, 279–280 backlinks. See What Links Here article titles, 168–169 template parameters and, 271 Bad Jokes and Other Deleted Nonsense changing. See moving pages [[ ]] (square brackets), internal links and, 149–150 (BJAODN), 351 forbidden characters in, 169 ~ (tilde), in signatures, 115, 341 bans, 403. See also blocks lowercase in, 169 1.0. See Wikipedia 1.0 WikiProject barnstars, 333–334 articles 3RR. See Three-Revert Rule (3RR) Be Bold, 138, 365–366 creating, 162–170, 167 5P. See Five Pillars guideline, 378 definition of, 5 1911 Encyclopaedia Britannica, 163 BEANS, 366 editing. See editing Bibliography section, 103 missing, 163 A biographies, 7 number of, 3–5, 4. See also milestones article titles and, 168–169 academic qualifications, of editors, 53–57, 316 policies for. See policies, content of living persons, 23, 52, 207 accounts.
    [Show full text]
  • Indian Language Wikipedias: a Comparison Study
    International Journal of Emerging Engineering Research and Technology Volume 3, Issue 4, April 2015, PP 93-97 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) Indian Language Wikipedias: A Comparison Study Vasudevan T V Asst Professor, Department of Computer Applications,MES College of Engineering, Kuttippuram, Kerala, India ABSTRACT Wikipedia is a popular, free, publicly editable internet encyclopedia supported by the non-profit Wikimedia Foundation. This paper presents an overview of research in the Indian Language Wikipedias. Different research areas related with Wikipedia are examined first. This is followed by a comparison study of major Indian Language Wikipedias which analyses the fundamental components of Wikipedia such as articles, authors and edits. Keywords: Wikipedia, Indian Language, Quantitative Analysis, Articles, Authors, Edits INTRODUCTION Wikipedia is a free online multilingual encyclopedia that can be edited by anyone. Wikipedia is supported by the non-profit Wikimedia Foundation. It was launched on January 15, 2001 [ 1 ]. Presently it contains 35 million articles in 288 languages. The English Edition of Wikipedia itself contains over 4.8 million articles as compared to more than 120,000 articles in the next largest English language encyclopedia, Encyclopedia Britannica Online [2]. Wikipedia is interesting to research because of the vastness and open nature of its data. We can analyse various topics such as fundamental components, structure and growth of information, author collaboration etc. HISTORY OF WIKIPEDIA IN INDIAN LANGUAGES Assamese Wikipedia, the first Indian Language Wikipedia was started in 2nd June, 2002. However, Tamil Wikipedia was the first one to reach the milestone of 100 articles. It crossed a century of articles in January 2004.
    [Show full text]