Ferenda Documentation Release 0.3.1.Dev1

Total Page:16

File Type:pdf, Size:1020Kb

Ferenda Documentation Release 0.3.1.Dev1 Ferenda Documentation Release 0.3.1.dev1 Staffan Malmgren Nov 22, 2018 Contents 1 Introduction to Ferenda 3 1.1 Example.................................................3 1.2 Prerequisites...............................................5 1.3 Installing.................................................5 1.4 Features..................................................6 1.5 Next step.................................................6 2 First steps 7 2.1 Creating a Document repository class..................................8 2.2 Using ferenda-build.py and registering docrepo classes.........................8 2.3 Downloading...............................................9 2.4 Parsing.................................................. 10 2.5 Republishing the parsed content..................................... 12 3 Creating your own document repositories 17 3.1 Writing your own download implementation............................. 17 3.2 Writing your own parse implementation............................... 21 3.3 Calling relate() ........................................... 27 3.4 Calling makeresources() ...................................... 27 3.5 Customizing generate() ....................................... 28 3.6 Customizing toc() ........................................... 33 3.7 Customizing news() .......................................... 34 3.8 Customizing frontpage() ...................................... 34 3.9 Next steps................................................ 35 4 Key concepts 37 4.1 Project.................................................. 37 4.2 Configuration............................................... 37 4.3 DocumentRepository........................................... 40 4.4 Document................................................ 40 4.5 Identifiers................................................. 40 4.6 DocumentEntry.............................................. 41 4.7 File storage................................................ 41 4.8 Resources and the loadpath....................................... 43 5 Parsing and representing document metadata 45 5.1 Document URI.............................................. 45 i 5.2 Adding metadata using the RDFLib API................................ 45 5.3 A simpler way of adding metadata.................................... 46 5.4 Vocabularies............................................... 46 5.5 Serialization of metadata......................................... 47 5.6 Metadata about parts of the document.................................. 47 6 Building structured documents 49 6.1 Creating your own element classes................................... 50 6.2 Mixin classes............................................... 50 6.3 Rendering to XHTML.......................................... 50 6.4 Convenience methods.......................................... 51 7 Parsing document structure 53 7.1 FSMParser................................................ 53 7.2 A simple example............................................ 54 7.3 Writing complex parsers......................................... 56 7.4 Tips for debugging your parser...................................... 57 8 Citation parsing 59 8.1 The built-in solution........................................... 59 8.2 Extending the built-in support...................................... 60 8.3 Rolling your own............................................. 62 9 Reading files in various formats 63 9.1 Reading plain text files.......................................... 63 9.2 Microsoft Word documents....................................... 63 9.3 PDF documents............................................. 63 10 Grouping documents with facets 65 10.1 Applying facets.............................................. 65 10.2 Selectors and identificators........................................ 65 10.3 Contexts where facets are used...................................... 66 10.4 Grouping a document in several groups................................. 66 10.5 Combining facets from different docrepos................................ 67 11 Customizing the table(s) of content 69 11.1 Defining facets for grouping and sorting................................. 69 11.2 Getting information about all documents................................ 70 11.3 Making the TOC pages.......................................... 70 11.4 The first page............................................... 71 12 Customizing the news feeds 73 13 The WSGI app 75 13.1 Running the web application....................................... 75 13.2 URLs for retrieving resources...................................... 76 14 The ReST API for querying 79 14.1 Free text queries............................................. 79 14.2 Result lists................................................ 79 14.3 Parameters................................................ 80 14.4 Paging.................................................. 80 14.5 Statistics................................................. 81 14.6 Ranges.................................................. 81 14.7 Support resources............................................ 81 ii 14.8 Legacy mode............................................... 82 15 Setting up external databases 83 15.1 Triple stores............................................... 83 15.2 Fulltext search engines.......................................... 85 16 Testing your docrepo 87 16.1 Extra assert methods........................................... 87 16.2 Creating parametric tests......................................... 87 16.3 RepoTester................................................ 88 16.4 Download tests.............................................. 88 16.5 Distill and parse tests........................................... 89 16.6 Py23DocChecker............................................. 89 16.7 testparser................................................. 89 17 Advanced topics 91 17.1 Composite docrepos........................................... 91 17.2 Patch files................................................. 93 17.3 External annotations........................................... 93 17.4 Keyword hubs.............................................. 93 17.5 Custom common data.......................................... 93 17.6 Custom ontologies............................................ 94 17.7 Parallel processing............................................ 94 18 API reference 97 18.1 Classes.................................................. 97 18.2 Modules................................................. 152 18.3 Decorators................................................ 163 18.4 Errors................................................... 164 18.5 Document repositories.......................................... 166 18.6 Changes................................................. 175 19 Indices and tables 183 Python Module Index 185 iii iv Ferenda Documentation, Release 0.3.1.dev1 Ferenda is a python library and framework for transforming unstructured document collections into structured Linked Data. It helps with downloading documents, parsing them to add explicit semantic structure and RDF-based metadata, finding relationships between documents, and republishing the results. Contents 1 Ferenda Documentation, Release 0.3.1.dev1 2 Contents CHAPTER 1 Introduction to Ferenda Ferenda is a python library and framework for transforming unstructured document collections into structured Linked Data. It helps with downloading documents, parsing them to add explicit semantic structure and RDF-based metadata, finding relationships between documents, and republishing the results. It uses the XHTML and RDFa standards for representing semantic structure, and republishes content using Linked Data principles and a REST-based API. Ferenda works best for large document collections that have some degree of internal standardization, such as the laws of a particular country, technical standards, or reports published in a series. It is particularly useful for collections that contains explicit references between documents, within or across collections. It is designed to make it easy to get started with basic downloading, parsing and republishing of documents, and then to improve each step incrementally. 1.1 Example Ferenda can be used either as a library or as a command-line tool. This code uses the Ferenda API to create a website containing all(*) RFCs and W3C recommended standards. from ferenda.sources.tech import RFC, W3Standards from ferenda.manager import makeresources, frontpage, runserver, setup_logger from ferenda.errors import DocumentRemovedError, ParseError, FSMStateError config={'datadir':'netstandards/exampledata', 'loglevel':'DEBUG', 'force':False, 'storetype':'SQLITE', 'storelocation':'netstandards/exampledata/netstandards.sqlite', 'storerepository':'netstandards', 'downloadmax': 50 # remove this to download everything } setup_logger(level='DEBUG') (continues on next page) 3 Ferenda Documentation, Release 0.3.1.dev1 (continued from previous page) # Set up two document repositories docrepos= (RFC( **config), W3Standards(**config)) for docrepo in docrepos: # Download a bunch of documents docrepo.download() # Parse all downloaded documents for basefile in docrepo.store.list_basefiles_for("parse"): try: docrepo.parse(basefile) except ParseError as e: pass # or handle this in an appropriate way # Index the text content and metadata of all parsed documents for basefile in docrepo.store.list_basefiles_for("relate"): docrepo.relate(basefile, docrepos) # Prepare various assets for web site navigation makeresources(docrepos,
Recommended publications
  • Beautiful Soup Documentation Release 4.4.0
    Beautiful Soup Documentation Release 4.4.0 Leonard Richardson Dec 24, 2019 Contents 1 Getting help 3 2 Quick Start 5 3 Installing Beautiful Soup 9 3.1 Problems after installation........................................9 3.2 Installing a parser............................................ 10 4 Making the soup 13 5 Kinds of objects 15 5.1 Tag .................................................... 15 5.2 NavigableString .......................................... 17 5.3 BeautifulSoup ............................................ 18 5.4 Comments and other special strings................................... 18 6 Navigating the tree 21 6.1 Going down............................................... 21 6.2 Going up................................................. 24 6.3 Going sideways.............................................. 25 6.4 Going back and forth........................................... 27 7 Searching the tree 29 7.1 Kinds of filters.............................................. 29 7.2 find_all() .............................................. 32 7.3 Calling a tag is like calling find_all() ............................... 36 7.4 find() ................................................. 36 7.5 find_parents() and find_parent() .............................. 37 7.6 find_next_siblings() and find_next_sibling() .................... 37 7.7 find_previous_siblings() and find_previous_sibling() .............. 38 7.8 find_all_next() and find_next() ............................... 38 7.9 find_all_previous() and find_previous() .......................
    [Show full text]
  • Introduction to HTML5
    "Charting the Course ... ... to Your Success!" Introduction to HTML5 Course Summary Description HTML5 is not merely an improvement on previous versions, but instead a complete re-engineering of browser-based markup. It transforms HTML from a document description language to an effective client platform for hosting web applications. For the first time developers have native support for creating charts and diagrams, playing audio and video, caching data locally and validating user input. When combined with related standards like CSS3, Web Sockets and Web Workers it is possible to build ‘Rich Web Applications’ that meet modern usability requirements without resorting to proprietary technologies such as Flash and Silverlight. This course enables experienced developers to make use of all the features arriving in HTML5 and related specifications. During the course delegates incrementally build a user interface for a sample web application, making use of all the new features as they are taught. By default the course uses the Dojo Framework to simplify client-side JavaScript and delegates are presented with server-side code written in Spring MVC 3. Other technology combinations are possible if required. Topics Review of the Evolution of HTML Playing Audio and Video Better Support for Data Entry Hosting Clients in HTML5 Support for Drawing Images and Standards Related to HTML5 Diagrams Prerequisites Students should have experience of web application development in a modern environment such as JEE, ASP .NET, Ruby on Rails or Django. They must be very familiar with HTML4 and/or XHTML and the fundamentals of programming in JavaScript. If this is not the case then an additional ‘primer’ day can be added to the delivery.
    [Show full text]
  • Linux Toolsettoolset (Part(Part A)A)
    [Software[Software Development]Development] LinuxLinux ToolSetToolSet (part(part A)A) Davide Balzarotti Eurecom – Sophia Antipolis, France 1 If Unix can Do It, You Shouldn't As a general rule: Don't implement your own code unless you have a solid reason why common tools won't solve the problem Modern Linux distributions are capable of performing a lot of tasks Most of the hard problems have already been solved (by very smart people) for you... … you just need to know where to look … and choose the right tool for your job 2 Different Tools for Different Jobs Locate the data Extract the data Filter the data Modify the data 3 Getting Help Any Linux distribution comes with thousands of files in /usr/bin Knowing each command is difficult Knowing each option of each command is practically impossible The most useful skill is to know where to look for help Google knows all the answers (if you know how to pose the right questions) help ± documentation for the shell builtin commands cmd --help ± most of the programs give you some help if you ask 4 The Manual Unix man pages were introduced in 1971 to have an online documentation of all the system commands The pages are organized in 8 sections: 1. general commands 5. file formats 2. system calls 6. games 3. C functions 7. misc 4. special files 8. admin commands CAL(1) BSD General Commands Manual CAL(1) NAME cal, ncal - displays a calendar and the date of easter SYNOPSIS cal [-3jmy] [[month] year] ncal [-jJpwy] [-s country_code] [[month] year] ncal [-Jeo] [year] DESCRIPTION The cal utility displays a simple calendar in traditional format and ncal offers an alternative layout, more options and the date of easter 5 Other forms of Help info ± access the Info pages Official documentation of the GNU project Simple hypertext apropos ± searches for keywords in the header lines of all the manpages balzarot:~> apropos calendar cal (1) - displays a calendar and the date of easter calendar (1) - reminder service gcal (1) - a program for calculating and printing calendars.
    [Show full text]
  • BSCW Administrator Documentation Release 7.2.1
    BSCW Administrator Documentation Release 7.2.1 OrbiTeam Software Oct 10, 2019 CONTENTS 1 How to read this Manual1 2 Installation of the BSCW server3 2.1 General Requirements........................................3 2.2 Security considerations........................................4 2.3 EU - General Data Protection Regulation..............................4 2.4 Upgrading to BSCW 7.2.1......................................5 2.4.1 Upgrading on Unix..................................... 13 2.4.2 Upgrading on Windows................................... 17 3 Installation procedure for Unix 19 3.1 System requirements......................................... 19 3.2 Installation.............................................. 20 3.3 Software for BSCW Preview..................................... 25 3.4 Configuration............................................. 29 3.4.1 Apache HTTP Server Configuration............................ 29 3.4.2 BSCW instance configuration............................... 33 3.4.3 Administrator account................................... 35 3.4.4 De-Installation....................................... 35 3.5 Database Server Startup, Garbage Collection and Backup..................... 36 3.5.1 BSCW Startup....................................... 36 3.5.2 Garbage Collection..................................... 37 3.5.3 Backup........................................... 37 3.6 Folder Mail Delivery......................................... 38 3.6.1 BSCW mail delivery agent (MDA)............................. 38 3.6.2 Local Mail Transfer Agent
    [Show full text]
  • Doctype Switching in Modern Browsers
    Thomas Vervik, July 2007 Doctype switching in modern browsers Summary: Some modern browsers have two rendering modes. Quirk mode renders an HTML document like older browsers used to do it, e.g. Netscape 4, Internet Explorer 4 and 5. Standard mode renders a page according to W3C recommendations. Depending on the document type declaration present in the HTML document, the browser will switch into either quirk mode, almost standard or standard mode. If there is no document type declaration present, the browser will switch into quirk mode. This paragraph summaries this article. I will explain how the main browsers on the marked today determine which rendering mode to use when rendering the (x)html documents they receive. I have tested nearly all my assertions in Internet Explorer 6, Firefix 2 and Opera 9.02. The validation is done at the official W3 validation page http://validator.w3.org. Some of my assertions are tested using pages on the net. This is done when testing the media types ‘text/html’ and ‘application/xhtml+xml’with html and xhtml with both legal and illegal syntax. My previous article was full of vague assertions and even things that were directly wrong. This should not be the case in this article where nearly all the assertions are tested. One section I should be humble about is the ‘Doctype dissection’. Finding good sources decribing these in more detail than pages and books just briefly describing their syntax proved hard, but I have done my best and have also described in the text which section I’m certain about and the one I am more uncertain about.
    [Show full text]
  • Second Exam December 19, 2007 Student ID: 9999 Exam: 2711 CS-081/Vickery Page 1 of 5
    Perfect Student Second Exam December 19, 2007 Student ID: 9999 Exam: 2711 CS-081/Vickery Page 1 of 5 NOTE: It is my policy to give a failing grade in the course to any student who either gives or receives aid on any exam or quiz. INSTRUCTIONS: Circle the letter of the one best answer for each question. 1. Which of the following is a significant difference between Windows and Unix for web developers? A. The Apache web server works only on Windows, not on Unix. B. Dreamweaver cannot be used to develop web sites that will be hosted on Unix. C. File and directory names are case-sensitive on Unix, but not on Windows; you have to do a case- sensitive link check if you develop pages on Windows but they might be copied to a Unix-hosted server. D. Firefox works only on Unix, not Windows. E. Windows is just another name for Unix; there is no difference between them at all. 2. What is the purpose of the css directory of a web site? A. It holds JavaScript programs. B. It holds XHTML Validation code. C. It holds background images. D. It holds stylesheets. E. It is needed for compatibility with Apache and PHP. 3. What is the purpose of the images directory of a web site? A. To hold the <img> tags for the site. B. To hold the stylesheets for images. C. To differentiate between JPEG and PNG images. D. To hold the imaginary components of the DOCTYPE. E. To hold photographic and graphical images used in the site.
    [Show full text]
  • Lxmldoc-4.5.0.Pdf
    lxml 2020-01-29 Contents Contents 2 I lxml 14 1 lxml 15 Introduction................................................. 15 Documentation............................................... 15 Download.................................................. 16 Mailing list................................................. 17 Bug tracker................................................. 17 License................................................... 17 Old Versions................................................. 17 2 Why lxml? 18 Motto.................................................... 18 Aims..................................................... 18 3 Installing lxml 20 Where to get it................................................ 20 Requirements................................................ 20 Installation................................................. 21 MS Windows............................................. 21 Linux................................................. 21 MacOS-X............................................... 21 Building lxml from dev sources....................................... 22 Using lxml with python-libxml2...................................... 22 Source builds on MS Windows....................................... 22 Source builds on MacOS-X......................................... 22 4 Benchmarks and Speed 23 General notes................................................ 23 How to read the timings........................................... 24 Parsing and Serialising........................................... 24 The ElementTree
    [Show full text]
  • Linux Floobydust.Pdf
    Linux Floobydust Copyright (c) 2011 Dimitri Marinakis Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". Table of Contents Introduction History Acknowledgements References Disclaimer Linux applications GNOME Desktop Autostart Firefox - disable uninitiated and unintended requests Allow desktop access to other local users Decode/encode base64 files Wireless adapters – ndiswrapper Camera access Midnight Commander file manager – zip file issue Find files modified today Formatted table of contents Floppy disk directory listing CD disk directory listing CD Directory tree Change directory – file mode Change file date and time PCTel modem Remove control characters from files Poor man's concordance Linux Floobydust.111215 Page 1 of 48 Count lines and words Convert doc files to txt Remove zero bytes from the input file Remove empty paragraphs Find ASCII files Convert filenames from one encoding to another Change filenames to lower case Change filenames to upper case Crypto file system Compare directories Compare directory structures (file systems) Video conversions Use a computer as a gateway to the Internet Deploy a system on a bunch of new computers Search files for a pattern Inter-process communications (IPC) Image and pdf conversions Infra-red port (Toshiba Satellite 1800) GPRS info for various networks, Linux Print using gs Completely overwrite and delete files Completely overwrite 720kb floppy Completely overwrite 1.4Mb floppy Completely overwrite device USB etc.
    [Show full text]
  • VISA/APCO/STAC 2P61 WEBSITE CREATION Fall Term 2012 ______
    VISA/APCO/STAC 2P61 WEBSITE CREATION Fall Term 2012 __________________________________________________________________________________ GETTING STARTED WITH XHTML AND CSS Do you have the book recommended for this course? HTML and CSS Web Standards Solutions A Web Standardistasʼ Approach Christopher Murphy and Nicklas Persson If you donʼt have the recommended book please find another. You will need something to refer to. If not the book…here are some websites where you can find good tutorials for writing standards based XHTML and CSS http://www.w3.org/MarkUp/Guide/ http://www.w3.org/MarkUp/2004/xhtml-faq http://www.htmldog.com http://xhtml.com/en/xhtml/reference/ http://www.webheadstart.org/xhtml/basics/index.html http://daringfireball.net/projects/markdown/ http://www.alistapart.com/ HTML Hypertext markup language Create with range of tools – plain text editor (with formatting turned off) Will be read by variety of devices Designed to be read by web browsers Non proprietary Open source Free Structures text to hyperlinks W3C 1995 first specs written. Current is 4.0 written in 1999 (HTML 5 in the works) XHTML XML (no defined tags like HTML – has defined structure) Extensive Markup Language Is HTML reformatted in XML HTML with strict rules of XML Varieties of XHTML – 1.0 Strict, (we use this one) 1.0 Transitional, 1.0 Frameset 1 The value of Standards • Avoid (as much as possible) sites that wonʼt display as written • More accessible sites • XHTML and CSS written to strict standards • Enables site to perform predictably on any standards compliant browser or OS • Improves Development time • Ease of updating • Search engine ranking XHTML/CSS Separates Content from Visual presentation/design Avoids Tag Soup of HTML where tags were used to control both how content is structured and how it looks.
    [Show full text]
  • Browser Mode and Document Mode
    Browser Mode And Document Mode When Hamish void his creatorship rebroadcasts not tarnal enough, is Ali angriest? Measlier and andtempestuous Burman Davon Rayner never transvalue slogs out-of-doorsher air kayo plumbwhen Titusor idolizes bushwhacks neutrally, his is corpus. Elton quadricipital? Batty Change to other one of the available browser modes available. How can test this kind of web pages? Although the Quirks mode is primarily about CSS, so that the website will not break. Add the header to all resources. Configure it as an HTTP response header as below. It is possible to bulk add from a file in the Site list manager. For the best experience, the only other potential cause for this would be Compatibility View Settings. The result is a cloudy atmospheric look. If the Compatibility View Settings checkboxes are checked, requests, California. They have the tools needed to list which sites can be whitelisted to run in compatibility mode. Determines how the Spread control renders content for a specific browser mode. Modern versions of IE always have the Trident version number in the User Agent string. Are you facing any issues? Some engines had modes that are not relevant to Web content. Or better yet, it would be nice to use a method that does not depend on the navigator. Since different browsers support different objects, instead of resolving domain names to the web hosts, techniques and components interact. Suppose the DTD is copied to example. Thanks for the feedback Jamie. When adding URL, those sites may not work correctly anymore, removing the code should be enough to stop sending it.
    [Show full text]
  • Beautiful Soup Documentation Release 4.2.0
    Beautiful Soup Documentation Release 4.2.0 Leonard Richardson February 26, 2014 CONTENTS 1 Getting help 3 2 Quick Start 5 3 Installing Beautiful Soup 9 3.1 Problems after installation........................................9 3.2 Installing a parser............................................ 10 4 Making the soup 13 5 Kinds of objects 15 5.1 Tag .................................................... 15 5.2 NavigableString .......................................... 17 5.3 BeautifulSoup ............................................ 17 5.4 Comments and other special strings................................... 17 6 Navigating the tree 19 6.1 Going down............................................... 19 6.2 Going up................................................. 22 6.3 Going sideways.............................................. 23 6.4 Going back and forth........................................... 25 7 Searching the tree 27 7.1 Kinds of filters.............................................. 27 7.2 find_all() .............................................. 29 7.3 Calling a tag is like calling find_all() ............................... 33 7.4 find() ................................................. 33 7.5 find_parents() and find_parent() .............................. 34 7.6 find_next_siblings() and find_next_sibling() .................... 34 7.7 find_previous_siblings() and find_previous_sibling() .............. 35 7.8 find_all_next() and find_next() ............................... 35 7.9 find_all_previous() and find_previous() .......................
    [Show full text]
  • Free Software to Edit Pdf Documents
    Free Software To Edit Pdf Documents SearchingThornie often and entoils epidemic proportionally Godwin misspeaks when reigning so whistlingly Demetre that clearcole Saunders laigh busses and literalized his tames. her sternums. discriminatively.Unmetaphysical Baxter emblematizes her dumbwaiter so probably that Donald vitriolizing very If each item, sign your convenience of to free software Unfortunately does it goes to do a perfect solution designed programs will run an agent you would with the options such as per page. An interpreter-in-one free online PDF editor that unite not require subscriptions or installations DeftPDF is fault free online tool that makes editing and converting easy in. Pdf software remains private, free for documentation easily upload fonts, text and change. Including the ability to edit protect convert annotate password protect manage sign PDF documents with opportunity This software includes a somewhat trial. PDF-XChange Editor. Pdf documents like to free version allows you! Which cool the route free PDF editor software? Wondering how children edit PDF files Look snow further than DocFly Easily edit tool on PDF documents with him free online PDF editor No extra stuff to download. Prices are packed with more features available to flipbook and use inside the help of mobile. PDF Buddy Online PDF Editor. Best Free PDF Editor for Windows in 2021. Convert PDF Quickly As you would expect something useful PDF editing software also offers a built-in PDF converting feature It supports converting PDF documents to. In art case with brief introduction to possess of those Top free paid PDF editors with their. There are free document without requiring the documents into beautiful tunes sung by.
    [Show full text]