From Paper Book to a Digital One on Wikisource

Total Page:16

File Type:pdf, Size:1020Kb

From Paper Book to a Digital One on Wikisource From paper book to a digital one on Wikisource [[User:Xelgen]] Aleksey Chalabyan Armenian Wikipedia (hy.wikipedia.org) Armenian Wikisource (hy.wikisource.org) Wikisource ● Launched in 2003 ● 69 Languages ● Over 4 million pieces Image requirements ● 300 DPI or more ● As few geometrical/optical distortions as possible ● Evenly lit ● Color or grayscale Flatbed Image by Fir0002 CC-BY-SA, from Wikimedia Commons ADF (Auto Document Feeder) Document feeder scanner Image by [email protected] CC-BY-SA, from Wikimedia Commons Camera (or phone camera) Image by Plasmarelais, CC-BY-SA, from Wikimedia Commons Hand scanner Images by GBPublic_PR and Zoliverz, CC-BY-SA Wikimedia Commons DIY Book Scanner (http://diybookscanner.org) Image by daniel reetz, from http://diybookscanner.org Planetary document scanner Image by JamesMoorey CC-BY-SA, from Wikimedia Commons Prism book scanner (http://prismscanner.org) Professional book scanners Image by Marie-Lan Nguyen and Ra Boe CC-BY-SA, from Wikimedia Commons Time and Damage to Quality effort per Price Availability book page Flatbed High A Lot Somewhat 50-100$ Easy to find Close to 250- Not hard to ADF on flatbed/MFD High Very low irreversiable 400$ find Document scanner Close to 300- Need to High Extremly low (feeeder) irreversiable 450$ order one You probably Camera/Smartphone Low Significant None 150$+ have one Need to Hand scanner Low Too much Almost none 50-80$ order one Not hard to 300- DIY Book scanner High Very low Almost none build it 500$ yourself Planetary document Medium None 800$+ Order scanner Linear book scanner High Very low Somewhat ~1500$ Hard to build one, store and maintain 10 000$ Very hard to Pro book scanner High Very low Usually none + get Taking book apart Image by Xelgen CC-BY-SA Taking book apart Image by Xelgen CC-BY-SA Scan Tailor (http://scantailor.org) ● Fix rotation ● Split pages ● Deskew ● Autoselect content ● Setup margins OCR (Optical Character Recognition) ● ABBYY FineReader ● CuneiForms ● Tesseract Watch out before OCR Watch out before OCR 1 2 3 4 5 Wikipedia vs. Wikisource Wikisource Index page Wikisource Index page Wikisource Index page 1. Find a book which is free (or make it free) 2. Prepare your book for the scanning 3. Scan it* 4. Rename files if needed* 5. Crop and straighten images with ScanTailor* 6. Additional corrections with any image batch editor (e.g. ImageMagick or XnView)* 7. OCR* 8. Analyze and fix common mistakes in OCR software* 9. Export it as DjVu* 10. Upload to Commons ot Wikisource 11. Create Index page on commons 12. Start proofreading and encourage others to* * Double check your results Thank you! Questions? [[User:Xelgen]] .
Recommended publications
  • 教學大綱 098 1 2769 Public Domain 公版著作
    朝陽科技大學 098學年度第1學期教學大綱 Public Domain 公版著作 當 2769 2769 期 Course Number 課 號 授 毛慶禎 Mao,Ching Chen 課 Instructor 教 師 中 公版著作 Public Domain 文 Course Name 課 名 開 資訊管理系(四日)四C 課 Department 單 位 修 選修 Elective 習 Required/Elective 別 學 2 2 分 Credits 數 1. 公版著作物就是著作財產權消滅之著作,依照著作權法 1. Public domain materials are those whose copyrights 的規定,以自由利用為原則。著作財產權的存續期,以著 have expired and can be freely used. Copyright of 課 作人之生存期間及其死亡後五十年為限,攝影、視聽、錄 creative works extend up to 50 years after the death of 程 音及表演之著作財產權的存續期,以公開發表後五十年為 Objectives the author. Copyright of photos, videos, recordings, and 目 限。 performances extend up to 50 years after the first 標 2. 檢視現有的公版著作,蒐集整理。 publication. 2. Examine and organize existing public domain materials. 參考資源 http://sites.google.com/site/maolins/teaching/pd 1. The Public Domain: Enclosing the Commons of the Mind [公領域: 納入共用的思維] / James Boyle. -- Yale University Press (December 9, 2008). -- 336 p. -- ISBN-10: 0300137400, ISBN-13: 978-0300137408 [PDF] [HTML], http://www.thepublicdomain.org/ 2. 自由資訊概論, http://www.lins.fju.edu.tw/mao/works/freeinformation4lac.htm 3. 自由資訊概論, http://www.lins.fju.edu.tw/mao/works/mtp4www.htm, 2004/7 for PCOffice 4. 公版著作物 / 毛慶禎, 2003/09/1, http://www.lins.fju.edu.tw/mao/works/fspd.htm 5. 古騰堡計畫, 2003/02/19, http://www.lins.fju.edu.tw/mao/foi/pg.htm 6. 海盜灣(Pirate Bay), http://thepiratebay.org/ 7. TPB Tracker Geo Statistic, http://geo.keff.org/ 8. 開放式課程計畫, http://www.myoops.org/twocw/ 9. 合法下載何必盜版 10. 維基百科, http://tinyurl.com/wikipediataiwan 學英文救饑荒, http://www.freerice.com/ 古騰堡計畫, http://blue.lins.fju.edu.tw/~mao/foi/pg.htm 11.
    [Show full text]
  • Generating Openmath Content Dictionaries from Wikidata
    Generating OpenMath Content Dictionaries from Wikidata Moritz Schubotz Dept. of Computer and Information Science, University of Konstanz, Box 76, 78464 Konstanz, Germany, [email protected] Abstract OpenMath content dictionaries are collections of mathematical sym- bols. Traditionally, content dictionaries are handcrafted by experts. The OpenMath specification requires a name and a textual description in English for each symbol in a dictionary. In our recently published MathML benchmark (MathMLBen), we represent mathematical for- mulae in Content MathML referring to Wikidata as the knowledge base for the grounding of the semantics. Based on this benchmark, we present an OpenMath content dictionary, which we generated auto- matically from Wikidata. Our Wikidata content dictionary consists of 330 entries. We used the 280 entries of the benchmark MathMLBen, as well as 50 entries that correspond to already existing items in the official OpenMath content dictionary entries. To create these items, we proposed the Wikidata property P5610. With this property, everyone can link OpenMath symbols and Wikidata items. By linking Wikidata and OpenMath data, the multilingual community maintained textual descriptions, references to Wikipedia articles, external links to other knowledge bases (such as the Wolfram Functions Site) are connected to the expert crafted OpenMath content dictionaries. Ultimately, these connections form a new content dictionary base. This provides multi- lingual background information for symbols in MathML formulae. 1 Introduction and Prior Works Traditionally, mathematical formulae occur in a textual or situational context. Human readers infer the meaning of formulae from their layout and the context. An essential task in mathematical information retrieval (MathIR) is to mimic parts of this process to automate MathIR tasks.
    [Show full text]
  • February 5, 2011
    February Arguing the law with Nicolaus Everardi 5, 2011 Posted by rechtsgeschiedenis under Digital editions | Tags: Bibliography,Digital libraries, Great Council of Malines, Legal history, Medieval law,Netherlands, Rare books In the early sixteenth century some changes become already visible in the way lawyers approached the law. Not only was there a growing interest in the history of Roman and canon law, but lawyers began to free themselves from the framework offered by these legal systems. One of the signs of this are the titles of legal treatises, the growth itself of this genre, and a more systematic approach of law. Nicolaus Everardi’s book on legal argumentation, his Topicorum seu de locis legalibus liber (Louvain 1516) is an example of this development. The book of this Dutch lawyer who presided the Court of Holland and the Great Council of Malines became almost a bestseller because of the reprints published everywhere in Europe. Printers in Bologna, Basel, Paris, Lyon, Strasbourg, Venice, Frankfurt am Main and Cologne printed this book until the mid-seventeenth century. I have found eight reprints of the first edition and eighteen of the second edition. On the blog of the Arbeitsgemeinschaft Frühe Neuzeit Klaus Graf recently criticized sharply the new database Early Modern Thought Online (EMTO) of the Fernuniversität Hagen that enables you to search for editions of texts in the broad field of early modern philosophy and thought. The EMTO database notes in the search results the availability of online versions. In this respect Graf saw major shortcomings, because EMTO does not harvest its results from some of the major sources for early modern texts online.
    [Show full text]
  • The Wiki Family of Web Sites
    The Federal Lawyer In Cyberia MICHAEL J. TONSING The Wiki Family of Web Sites he growth in the number of informative Web set of resources. Much of what follows is drawn from the sites seems exponential. It is increasingly hard self-descriptions on the Wiki Web sites themselves. Tto keep up with them. If you’re not using the “Wiki” family of sites, you’re missing some sources that Wiktionary are almost stupefying in their scope. A “wiki” is essen- Wiktionary (en.wikipedia.org/wiki/Wiktionary) is tially a Web site in which contributors can add their a sister site to Wikipedia. Wiktionary is not an online own copy. (I capitalize the word in most instances in encyclopedia but an online dictionary, and its entries this column to make it clear that I am referring to a par- are about words. A Wiktionary entry focuses on mat- ticular family of wikis. There are many other wikis out ters of language and wordsmithery, spelling, pronun- there. You may decide to start your own someday.) ciation, etymology, translation, usage, quotations, and links to related words and concepts. Because Wiktion- Wikipedia ary is not written on paper, it has no size limits, it can Wikipedia (www.wikipedia.org), then, is an online include links, and its information can be more timely encyclopedia in which visitors can read infor- than that found in a written dictionary. mation on the topics they visit, then edit the information on the site itself if they choose Wikisource to do so. (The name “Wikipedia” is a meld- Wikisource (en.wikipedia.org/wiki/Wikisource), which ing of the works “wiki” and “encyclopedia.”) dubs itself “the free library,” collects and stores previ- Out of what might at first have seemed like ously published texts in digital format; contents include online chaos has come semirespectability and novels, nonfiction works, letters, speeches, constitutional order.
    [Show full text]
  • Main Page from Meta, a Wikimedia Project Coordination Wiki Jump To
    Main Page From Meta, a Wikimedia project coordination wiki Jump to: navigation <#mw-head>, search <#p-search> Meta-Wiki * * Mission </wiki/Special:MyLanguage/Mission> * * Projects </wiki/Wikimedia_projects>> (complete list </wiki/Complete_list_of_Wikimedia_projects>) * * Research </wiki/Research:Index> * * Translation </wiki/Special:MyLanguage/Meta:Babylon> (requests </wiki/Translation_requests>) * * Vision </wiki/Special:MyLanguage/Vision> *Welcome to Meta-Wiki </wiki/Meta:About_Meta>*, the global community site for the Wikimedia projects </wiki/Wikimedia_projects> and the Wikimedia movement </wiki/Wikimedia_movement> in general. Meta-Wiki's discussions range from coordination and documentation to planning and analysis of future Wikimedia activities. Other meta-focused wikis such as Wikimedia Outreach <//outreach.wikimedia.org/wiki/> are specialized projects that have their roots in Meta-Wiki. Related discussion also takes place on Wikimedia mailing lists </wiki/Mailing_lists/Overview> (particularly *wikimedia-l </wiki/Mailing_lists#Wikimedia_mailing_list>*, with its low-traffic equivalent WikimediaAnnounce </wiki/Mailing_lists#Wikimedia_Announcements_mailing_list>), IRC channels </wiki/IRC> on freenode, individual wikis of Wikimedia affiliates </wiki/Wikimedia_movement_affiliates>, and other places. *Goings-on* *Requests* [Edit / Translate <//meta.wikimedia.org/w/index.php?title=Template:Main_Page/WM_News&action=edit>] November 2015 Wikipedia-W-bold-in-square.svg </wiki/File:Wikipedia-W-bold-in-square.svg> 1: Wikipedia Asian Month </wiki/Wikipedia_Asian_Month>
    [Show full text]
  • Automatic Creation of Bilingual E-Books
    InLéctor: automatic creation of bilingual e-books Antoni Oliver González Abstract In this paper, a system for the automatic creation of parallel bilingual electronic books is presented. The system allows creating e-books, where source sentences are linked de 2017 with the corresponding target sentences. Users can read in Antoni Oliver González the original, and clicking on a given sentence, the corresponding sentence in the target language is shown. Universitat Oberta de Then she or he can continue reading the translation and desembre Catalunya coming back to the original version clicking in a target de [email protected]; language sentence. The source language book is ORCID: automatically aligned at the sentence level with the target 23 0000-0001-8399-3770 language translation of the book. This system is not using a machine translation system, but instead, it shows the published translation of the original work in the given target language. We have created several bilingual e-books using classic novels and its translations in the public domain, but the same system can be used for any book, provided you have the rights for the original and the translation. The system is aimed to people willing to read in the original, having a mid-high level of the source language. We also present the process of creation of bilingual dictionaries from free lexical resources. Both de 2017 | Publicació avançada: resources, the bilingual e-book and the bilingual dictionary can be of great help for readers willing to read books in the original version. Keywords: e-books; parallel texts; reading aid. Resum Aquest article presenta un sistema de creació automàtica de llibres bilingües amb textos alineats.
    [Show full text]
  • 200+ Fantastic Tools for Schools Superhandout  Join My Newsletter ( ) Where You’Ll Get Lots of Ideas
    200+ Fantastic Tools for Schools Superhandout Join my newsletter (http://bit.ly/coolcat-newsletter ) where you’ll get lots of ideas. Am I missing something? Email anything I’ve missed to [email protected]. Let’s connect on Twitter @coolcatteacher 5 Popular eBook Readers 1. Kindle - 2. iBook 3. Nook 4. Kobo – a cool independent bookstore with an ereader 5. Google Play Books 6. My favorite ebook reader? I use the Kindle Paperwhite but if you’re going to use an iPad, make sure you turn it on night shift so it doesn’t wake you up. Resources How to save your Kindle notes and Highlights into Evernote (or any word processor) http://www.coolcatteacher.com/kindle-notes-evernote-export/ How to save your iBook notes and Highlights into Evernote (or any word processor) http://www.coolcatteacher.com/export-ibooks-notes/ 21 Awesome Things you can do with Your Kindle (I love how they make a book cover out of an old book) 10 IBook hacks and tricks (some very cool things here) https://snapguide.com/supplies/ibooks/ Kobo’s store has lots of free and discounted education books at https://www.kobo.com/us/en/category/education-teaching 9 Best Sites with Free eBooks for Google Play - http://bit.ly/2fB94bj 10 Places to Find, Download, and Read Free or Inexpensive Books 7. Project Gutenberg 8. Free Booksy 9. Bookish 10. Goodreads – the social media network for people who love books 11. eReaderIQ 12. BookBub 13. HundredZeroes 14. What Should I Read Next? 15. Calibre – the “swiss army knife” for ebooks – converts all types of formats 16.
    [Show full text]
  • Wikimedia Ulteriori Informazioni Contatti
    WIKISOURCE è una biblioteca online multilingue liberamente consultabile sul Web. Wikisource si chiamava in origine PROJECT SOURCEBERG nel corso della sua progettazione (un gioco di parole basato su "Project Gutenberg"); aveva inizialmente l'intento di separare l'enciclopedia Wikipedia dalle fonti delle voci che vi erano pubblicate. Il progetto prese il via il 24 novembre 2003. Il 6 dicembre dello stesso anno fu scelto al termine di una votazione il nome Wikisource. A fronte di un gran numero di nuovi utenti e di pagine pubblicate, il 23 luglio 2004 il progetto acquisì un proprio dominio presso http://wikisource.org. Seguendo le tracce di Wikipedia una successiva votazione terminata il 12 maggio 2005 promosse l'adozione di sottodomini per le edizioni in lingue distinte tra cui l'italiano, che assieme ad altre tredici lingue cominciò la propria attività il 23 agosto 2005. La caratteristica principale di questo progetto è di essere interamente scritto da volontari: chiunque può inserire un nuovo testo istantaneamente. Per quanto i testi siano di norma inseriti da un solo utente, più spesso la sua elaborazione è il frutto del lavoro di molte persone che contribuiscono migliorarne la presentazione e l'accuratezza. Il risultato è un perenne "lavoro in corso", che cresce sempre e tende sempre a migliorarsi. I bibliotecari di Wikisource hanno alle spalle storie molto diverse: sono studenti, insegnanti, esperti o semplici appassionati di libri e letture, ognuno dei quali contribuisce nel proprio campo d'interesse. Wikisource crede che ogni persona abbia il diritto di imparare, ma anche che ogni libro abbia qualcosa da insegnare e meriti la massima diffusione e considerazione.
    [Show full text]
  • Crowdsourcing and Open Access: Collaborative Techniques for Disseminating Legal Materials and Scholarship Timothy K
    Santa Clara High Technology Law Journal Volume 26 | Issue 4 Article 4 2010 Crowdsourcing and Open Access: Collaborative Techniques for Disseminating Legal Materials and Scholarship Timothy K. Armstrong Follow this and additional works at: http://digitalcommons.law.scu.edu/chtlj Part of the Law Commons Recommended Citation Timothy K. Armstrong, Crowdsourcing and Open Access: Collaborative Techniques for Disseminating Legal Materials and Scholarship, 26 Santa Clara High Tech. L.J. 591 (2009). Available at: http://digitalcommons.law.scu.edu/chtlj/vol26/iss4/4 This Article is brought to you for free and open access by the Journals at Santa Clara Law Digital Commons. It has been accepted for inclusion in Santa Clara High Technology Law Journal by an authorized administrator of Santa Clara Law Digital Commons. For more information, please contact [email protected]. ARTICLES CROWDSOURCING AND OPEN ACCESS: COLLABORATIVE TECHNIQUES FOR DISSEMINATING LEGAL MATERIALS AND SCHOLARSHIP Timothy K. Armstrongt Abstract This short essay surveys the state of open access to primary legal source materials (statutes,judicial opinions and the like) and legal scholarship. The ongoing digitization phenomenon (illustrated, although by no means typified, by massive scanning endeavors such as the Google Books project and the Library of Congress's efforts to digitize United States historical documents) has made a wealth of information, including legal information,freely available online, and a number of open-access collections of legal source materials have been created. Many of these collections, however, suffer from similar flaws: they devote too much effort to collecting case law rather than other authorities, they overemphasize recent works (especially those originally created in digitalform), they do not adequately hyperlink between related documents in the collection, their citatorfunctions are haphazard and rudimentary, and they do not enable easy user t Associate Professor of Law, University of Cincinnati College of Law.
    [Show full text]
  • National Librarian's Report, September 2020 (PDF)
    National Librarian and Chief Executive’s Report to the Board September 2020 LIBRARIAN’S KEY EVENTS SINCE JUNE 2020 On 11 August 2020 following five months of lockdown, the Library reopened to the public. Hidden Collections – Quarterly Update During this reporting period the Metadata & Maintenance Team have been concentrating efforts on providing metadata for the wider access Scottish electronic publications from the hidden collections. This work has been carried out by staff members working from home and 2445 have been catalogued. The hidden collection contains publications from a variety of organisations ranging from Healthcare Improvement Scotland and the Scottish Parliament to Scottish Environment Link. It includes a significant donation of publications from Scottish Natural Heritage, who have chosen to archive their publications with the Library. The electronic publications in this collection are a significant contribution to worldwide scientific knowledge and chart the changes in approach and attitude to environmental conservation from the 1970s to the present day. The Scottish Government and Scottish Parliament continued to deposit new electronic publications during lockdown, many of these present the Scottish Government’s response to Covid-19. These electronic publications are accessible through Library Search and are freely available to all, not only registered readers of the Library. The Team continued to build on the success of the House of Commons hidden collection and provided access to 42,869 nineteenth century House of Lords publications. As with the previous project this involved loading records for electronic items accessed from the United Kingdom Parliamentary Papers e-resource into the Library catalogue and adding the holdings for the physical items to the catalogue records for the electronic versions.
    [Show full text]
  • Building Arabic Corpora from Wikisource
    Building Arabic Corpora from Wikisource Imene Bensalem, Salim Chikhi Paolo Rosso SCAL team, MISC laboratory Natural Language Engineering Lab. – EliRF Constantine 2 University Universitat Politècnica de València Constantine, Algeria Valencia, Spain [email protected], [email protected] [email protected] Abstract—This paper describes a new tool that helps believe that the tool described here will save time and effort of extracting clean text from the Arabic Wikisource dump in order researchers during the process of Arabic corpora building. to build corpora. The tool purpose is illustrated by the generation of a subcorpus from Wikisource, which is a step towards the building of an evaluation corpus for Arabic intrinsic plagiarism detection. Keywords—Arabic Wikisource; tools for building corpora; intrinsic plagiarism detection I. INTRODUCTION Building corpora is a time consuming task especially if the source of text is noisy, or if the intended text has specific criteria (e.g. specific genres or topics). Generally, researchers develop tailored or ad hoc codes to compile or preprocess the target text ; by consequence no one else can benefit from their codes since they are developed without the intention to be shared. In fact, tools of building Arabic corpora, particularly, are very few. These are two examples that we found in literature: Khoja [1] developed a tool that helps creating corpora from RSS feeds of blogs ; Meftouh et al. [2] proposed a software that crawls Arabic web pages according to a list of keywords. II. WHAT IS WIKISOURCE ? Wikisource 1 is a free web library that contains public Fig. 1. The tool achitecture domain textual documents, such as, heritage books.
    [Show full text]
  • Google Book Search and Fair Use: Itunes for Authors, Or Napster for Books?
    University of Miami Law Review Volume 61 Number 1 Volume 61 Number 1 (October 2006) Article 4 10-1-2006 Google Book Search and Fair Use: iTunes for Authors, or Napster for Books? Hannibal Travis Follow this and additional works at: https://repository.law.miami.edu/umlr Part of the Law Commons Recommended Citation Hannibal Travis, Google Book Search and Fair Use: iTunes for Authors, or Napster for Books?, 61 U. Miami L. Rev. 87 (2006) Available at: https://repository.law.miami.edu/umlr/vol61/iss1/4 This Article is brought to you for free and open access by the Journals at University of Miami School of Law Institutional Repository. It has been accepted for inclusion in University of Miami Law Review by an authorized editor of University of Miami School of Law Institutional Repository. For more information, please contact [email protected]. Google Book Search and Fair Use: iTunes for Authors, or Napster for Books? HANNIBAL TRAVIS* Google plans to digitize the books from five of the world's big- gest libraries into a keyword-searchable book-browsing library. Pub- lishers and many authors allege that this constitutes a massive piracy of their copyrights in books not yet in the public domain. But I argue that Google's book search capability may be a fair use for two inter- related reasons: it is unlikely to reduce the sales of printed books, and it promises to improve the marketing of books via an innovative book marketing platform featuring short previews. Books are an experi- ence good in economic parlance, or a product that must be consumed before full information about its contents and quality becomes availa- ble.
    [Show full text]