Ferenda Documentation Release 0.3.1.Dev1
Total Page:16
File Type:pdf, Size:1020Kb
Ferenda Documentation Release 0.3.1.dev1 Staffan Malmgren Nov 22, 2018 Contents 1 Introduction to Ferenda 3 1.1 Example.................................................3 1.2 Prerequisites...............................................5 1.3 Installing.................................................5 1.4 Features..................................................6 1.5 Next step.................................................6 2 First steps 7 2.1 Creating a Document repository class..................................8 2.2 Using ferenda-build.py and registering docrepo classes.........................8 2.3 Downloading...............................................9 2.4 Parsing.................................................. 10 2.5 Republishing the parsed content..................................... 12 3 Creating your own document repositories 17 3.1 Writing your own download implementation............................. 17 3.2 Writing your own parse implementation............................... 21 3.3 Calling relate() ........................................... 27 3.4 Calling makeresources() ...................................... 27 3.5 Customizing generate() ....................................... 28 3.6 Customizing toc() ........................................... 33 3.7 Customizing news() .......................................... 34 3.8 Customizing frontpage() ...................................... 34 3.9 Next steps................................................ 35 4 Key concepts 37 4.1 Project.................................................. 37 4.2 Configuration............................................... 37 4.3 DocumentRepository........................................... 40 4.4 Document................................................ 40 4.5 Identifiers................................................. 40 4.6 DocumentEntry.............................................. 41 4.7 File storage................................................ 41 4.8 Resources and the loadpath....................................... 43 5 Parsing and representing document metadata 45 5.1 Document URI.............................................. 45 i 5.2 Adding metadata using the RDFLib API................................ 45 5.3 A simpler way of adding metadata.................................... 46 5.4 Vocabularies............................................... 46 5.5 Serialization of metadata......................................... 47 5.6 Metadata about parts of the document.................................. 47 6 Building structured documents 49 6.1 Creating your own element classes................................... 50 6.2 Mixin classes............................................... 50 6.3 Rendering to XHTML.......................................... 50 6.4 Convenience methods.......................................... 51 7 Parsing document structure 53 7.1 FSMParser................................................ 53 7.2 A simple example............................................ 54 7.3 Writing complex parsers......................................... 56 7.4 Tips for debugging your parser...................................... 57 8 Citation parsing 59 8.1 The built-in solution........................................... 59 8.2 Extending the built-in support...................................... 60 8.3 Rolling your own............................................. 62 9 Reading files in various formats 63 9.1 Reading plain text files.......................................... 63 9.2 Microsoft Word documents....................................... 63 9.3 PDF documents............................................. 63 10 Grouping documents with facets 65 10.1 Applying facets.............................................. 65 10.2 Selectors and identificators........................................ 65 10.3 Contexts where facets are used...................................... 66 10.4 Grouping a document in several groups................................. 66 10.5 Combining facets from different docrepos................................ 67 11 Customizing the table(s) of content 69 11.1 Defining facets for grouping and sorting................................. 69 11.2 Getting information about all documents................................ 70 11.3 Making the TOC pages.......................................... 70 11.4 The first page............................................... 71 12 Customizing the news feeds 73 13 The WSGI app 75 13.1 Running the web application....................................... 75 13.2 URLs for retrieving resources...................................... 76 14 The ReST API for querying 79 14.1 Free text queries............................................. 79 14.2 Result lists................................................ 79 14.3 Parameters................................................ 80 14.4 Paging.................................................. 80 14.5 Statistics................................................. 81 14.6 Ranges.................................................. 81 14.7 Support resources............................................ 81 ii 14.8 Legacy mode............................................... 82 15 Setting up external databases 83 15.1 Triple stores............................................... 83 15.2 Fulltext search engines.......................................... 85 16 Testing your docrepo 87 16.1 Extra assert methods........................................... 87 16.2 Creating parametric tests......................................... 87 16.3 RepoTester................................................ 88 16.4 Download tests.............................................. 88 16.5 Distill and parse tests........................................... 89 16.6 Py23DocChecker............................................. 89 16.7 testparser................................................. 89 17 Advanced topics 91 17.1 Composite docrepos........................................... 91 17.2 Patch files................................................. 93 17.3 External annotations........................................... 93 17.4 Keyword hubs.............................................. 93 17.5 Custom common data.......................................... 93 17.6 Custom ontologies............................................ 94 17.7 Parallel processing............................................ 94 18 API reference 97 18.1 Classes.................................................. 97 18.2 Modules................................................. 152 18.3 Decorators................................................ 163 18.4 Errors................................................... 164 18.5 Document repositories.......................................... 166 18.6 Changes................................................. 175 19 Indices and tables 183 Python Module Index 185 iii iv Ferenda Documentation, Release 0.3.1.dev1 Ferenda is a python library and framework for transforming unstructured document collections into structured Linked Data. It helps with downloading documents, parsing them to add explicit semantic structure and RDF-based metadata, finding relationships between documents, and republishing the results. Contents 1 Ferenda Documentation, Release 0.3.1.dev1 2 Contents CHAPTER 1 Introduction to Ferenda Ferenda is a python library and framework for transforming unstructured document collections into structured Linked Data. It helps with downloading documents, parsing them to add explicit semantic structure and RDF-based metadata, finding relationships between documents, and republishing the results. It uses the XHTML and RDFa standards for representing semantic structure, and republishes content using Linked Data principles and a REST-based API. Ferenda works best for large document collections that have some degree of internal standardization, such as the laws of a particular country, technical standards, or reports published in a series. It is particularly useful for collections that contains explicit references between documents, within or across collections. It is designed to make it easy to get started with basic downloading, parsing and republishing of documents, and then to improve each step incrementally. 1.1 Example Ferenda can be used either as a library or as a command-line tool. This code uses the Ferenda API to create a website containing all(*) RFCs and W3C recommended standards. from ferenda.sources.tech import RFC, W3Standards from ferenda.manager import makeresources, frontpage, runserver, setup_logger from ferenda.errors import DocumentRemovedError, ParseError, FSMStateError config={'datadir':'netstandards/exampledata', 'loglevel':'DEBUG', 'force':False, 'storetype':'SQLITE', 'storelocation':'netstandards/exampledata/netstandards.sqlite', 'storerepository':'netstandards', 'downloadmax': 50 # remove this to download everything } setup_logger(level='DEBUG') (continues on next page) 3 Ferenda Documentation, Release 0.3.1.dev1 (continued from previous page) # Set up two document repositories docrepos= (RFC( **config), W3Standards(**config)) for docrepo in docrepos: # Download a bunch of documents docrepo.download() # Parse all downloaded documents for basefile in docrepo.store.list_basefiles_for("parse"): try: docrepo.parse(basefile) except ParseError as e: pass # or handle this in an appropriate way # Index the text content and metadata of all parsed documents for basefile in docrepo.store.list_basefiles_for("relate"): docrepo.relate(basefile, docrepos) # Prepare various assets for web site navigation makeresources(docrepos,