arXiv:1809.06935v1 [cs.DL] 18 Sep 2018 eeec htfiei ilete aet s eaiepath relative a use to have like either will it <../data/survey.csv> file that reference hs ehooisrueteht/tp rgnULo a of URL origin http/https the reuse technologies these The file. ZIP relative a from within URIs resources absolute minting of for paths scheme URI app the file lik format file compressed this a .zip of in purpose packaged the typically metadata, For data. an meteorological article for general more used the also as is well as data, an astronomy datasets [ of resources transferring digital and other preservation for important ( instance for , internal resolve and generate i article this for Object a Research well as The at framewo available archive, Objects. programmatic file Research within a resources for archived inside resolve bundled to resources hypermedia reference data or linked consume to identifiers independent hsapoc a bnoe nfvu ftecombination the of and not favour track, did in Recommendation scheme abandoned of W3C application was URL the approach app corresponding this on the the further However progress file. from ZIP resolved like paths namespace own with their get would application each where iEclCoE BioExcel [ Data Linked COMBINE com sys Data information Distributed systems, retrieval systems, Content-based management Hypertext Identity pression, identifiers, sistent H2020-EINFRA-2015-1-675728 7 08IE.Ti okhsbe oea ato the of part as done been has work This IEEE. 2018 © a rgnlyitne o akgdwbapplications, web packaged for intended originally was ] The n hleg ihrgrst embedding to regards with challenge One ne Terms Index Abstract rhv omt like formats Archive e p Manifest App Web or eerhOjc Bundle Object Research h rhv n akg ac)UIscheme URI (arcp) Package and Archive The ..gz Teac R ceei nrdcdfrlocation- for introduced is scheme URI arcp —The archive rjc uddb h uoenCommision European the by funded project a , http://s11.no/2018/arcp.html#ro rhvs[ archives . Uiomrsuc oaos eatcWb Per- Web, Semantic locators, resource —Uniform nsc rhvsi o oreliably to how is archives such in sacleto fdt lswt related with files data of collection a is https://orcid.org/0000-0001-9842-9718 .B I. 3 [ 8 o ytm biology, systems for ] a oti an contain may 2 and ] BagIt h nvriyo Manchester of University The .Mr pcfi xmlsinclude examples specific More ]. ACKGROUND ). colo optrScience Computer of School 1 st rsm r-xsigWbURL Web pre-existing some or u nodrt correctly to order in but — evc Workers Service [ ta Soiland-Reyes Stian [ acetr UK Manchester, 1 6 aebe eonzdas recognized been have ] format ] odsrb h CSV the describe to suggested p R scheme URL app D Turtle RDF HDF5 CDF [ 9 .Together ]. [ 5 re-using which ] [ 4 for ] tems, and file rks . d e s s - itself. saigi edd h otfolder root The needed. if escaping /my%20project/about/intro.doc .Ietfirstructure address. Identifier IP A. public no lapt with a node on linked cloud occur a may a processing process or as to particular requirement in a archive, be data not should presence web ues: a in dataset a to reference by file or like like browser ways, repository web various a through from archive upload an of receive may independent which package, or location. archive or reference format any to archive URIs within mint to resources how specifies scheme which URI Internet-Draft (arcp) Package and Archive ro nweg ftecrepnigarchive. corresponding witho resolvable the directly of be knowledge to prior intended not are namespaces parts three archive. the of root like the archives for URL base a generate to need for them will expose to (say resources Data offline. work to application web a li allowing relative also with while together manifest application downloaded lryi a eiaporaet itnwwbbsdURIs based web new mint to inappropriate be like universally may not it introd like are ilarly may attacks they and of consistently, as risks create security purpose, to difficult this are for unique, well serve not ersne san as represented atclracieb corresponding a by archive particular a I T II. nprdb h p R ceew endthe defined we scheme URL app the by Inspired ydfiiina rpietfiri nUI[ URI an is identifier arcp an definition By h rmr s aefrac sfrcnuigapplications, consuming for is arcp for case use primary The The [ URIs file local using that clear be should It h rpItre-rf pcfistreinitial three specifies Internet-Draft arcp The uuid path HE Zenodo and i.1 tutr fac identifier arcp of Structure 1. Fig. ebun,Australia Melbourne, 2 oil Corporation Mozilla https://marcosc.com/ nd name prefix R path URI acsC´aceres Marcos or ahwihdfie o oidentify to how defines which each , FigShare P ACKAGE <../../etc/passwd> , namespace 1 . [ 12 nodrt as Linked parse to order In . e.g. ] ( / ARCP SPARQL ersn h archive the represent namespace sn percent- using — [ URI ) /path /file.txt 10 10 ,a IETF an ], o extracted for ] prefix ure) they queries), SCHEME > 12 These . with ] Sim- . nks, val- uce op do or as ut B. UUID-based identifiers With this method, anyone processing the byte-wise equal The simplest case for temporary sandbox processing of an archive (using the same hash method) will get the same archive with arcp is to generate a new random UUIDv4 [13], identifier. e.g.: Another advantage is that hash-identified archives can be retrieved from a NI resolver [14] using well known paths [15]: c6179148-3cde-4435-8e66-304453f89d59 Clients can verify the checksum of the downloaded archive, This base URI can be used when resolving relative URI so any accepting resolver endpoint can be used. references, e.g. if ref- D. Name-based identifiers erences <../data/survey.csv> we find the absolute URIs: Finally, paying homage to its origin in app URLs, arcp can use a system-based app name. This is a suggested mechanism name, where an application identifier can be directly reused The application is then able to do translation from arcp in arcp for URIs within that runtime system, e.g. to reference styles/resource1.css to local paths using URI parsing libraries to select the URI the resource within the installed com.example.myapp path, and augment that to the locally extracted path. Such package one can use the URI: arcp identifiers are temporary in nature, but the application can maintain a mapping from the UUID to the archive and As application package content do not necessarily corre- perform extraction on demand, or the archive can self-declare spond to archive file listings, it is open-ended how name-based its UUID, such as the External-Identifier header in arcp identifiers can be resolved, and indeed package content BagIt [1]. may vary per operating system, device type or application arcp also suggests how a UUID can be reliably created from version, and so name-based arcp identifiers should be treated the URL location of an archive. For instance, an application as system-local identifiers similar to file:/// URIs [11], may be processing a file from: but within a particular programming framework. http://example.com/download/archive13.zip> III. RELATED WORK The application can calculate the name-based UUIDv5 [13] A. Archive fragments by SHA1 hashing the URL string and mint: Without using arcp one could in theory still reference files within archives at an URL with # fragments: With this method anyone processing that archive URL will will still need to maintain a mapping to find the original Unlike formats like text/html or application/pdf, most archive URL. Location-based arcp identifiers may also not be archive media formats like application/zip unfortunately do not ideal for preservation purposes, as the archive might change define a fragment syntax, and some major types like tar.gz are upstream or move to a different location. not even listed in the IANA media types registry. Therefore C. Hash-based identifiers this would be an ad-hoc approach which still needs to clarify For this arcp defines a hash-based method, where the bytes details in order to be interoperable, for instance character of the archive file is used to find a checksum-based identifier escaping, if the root is # or #/, and how to reference nested based on the Naming Things With Hashes (ni) URI scheme fragment identifiers in hypermedia within archived resources. [14]. For instance if the sha-256 checksum of a Zip file is in B. File URIs hexadecimal: As argued above, file URLs [11] that represent local direc- 7f83b1657ff1fc53b92dc18148a1d65d tories are fragile and not globally unique. It is perhaps less fc2d4b1fa3d677284addd200126d9069 known that file URLs can specify a host name: After encoding the ni: uri would be: f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk> The above references a file path on the machine with the The corresponding arcp base URIs for resources within the fully qualified domain name (FQDN) host.example.com. archive is thus: The usually empty hostname is equivalent to localhost. extracted path are stable (e.g. a repository file server), but this faces the same challenges as minting http/https URLs, which resolvable. Using the same example as for OAI-ORE we can combine An ad-hoc possibility here could be to use a UUID [13] as F2R with PAV [20], as shown in figure 3. ”hostname” to represent an archive’s internal file system: The F2R approach have similar disadvantages as JAR and file://8f26cb8c-617e-46b4-bc48-e650bf70f33d OAI-ORE; in that the URIs do not support relative path /data/survey.csv/> resolution, that a web endpoint must be set up, and that the This is technically permittable as the file: URL scheme file paths are hidden through multiple steps. In addition one [11] do not define any particular connection protocols, and would need to assigned a corresponding file system name like an UUID is unlikely to be a valid hostname in DNS. mysource, although one may use a single file system as Such file: URIs could however cause confusion against exemplified above and use belongsToContainer to treat file paths on localhost, for instance Firefox 62.0 opens archive files as if they are folders. file://8cd4ce0d-4a41-4b4e-bfdd-1e2d0495f714/ to browse the local file F. EPUB canonical fragment identifiers system. EPUB is a standard for hypermedia eBooks. RO Bundle C. JAR URLs [6] is based on the EPUB Open Container Format [21]. If we restrict usage to ZIP files at a known URL, then they EPUB Canonical Fragment Identifiers [22] can link to nested are in theory also valid JAR files, and we can address files XML elements of an publication using a variation of XPath with the URL scheme: with doubled indexes: #epubcfi(/6/4[chap01ref]!/4[body01]/10[para05])> Here relative URIs may not parse well, as it is easy to The above example show an example to a paragraph with accidentally climb out of !/, and technically the JAR URI an ePub book. Here /6 refer to the 3rd element of the root scheme is missing the familiar :// to indicate for URI parser manifest’s element (which in ePub is always libraries that it is indeed an hierarchical URI scheme [12]. ), then /4[chap01ref] is the second element with :id="chap01ref". D. Object Reuse and Exchange proxies The ! character means the element’s reference is followed OAI-ORE [16] defines proxies to represent a resource as to open the corresponding XML file, where /4[body01] is aggregated in a collection; these can be used to model archives the 2nd element with id body01, traversed to find the 5th [17], but ORE proxies face two problems: How to represent element with id para05. the file path, and how to identify the proxy so it can be used While this is quite a powerful construct that can refer to as a reference in Linked Data. The resource must be identified any XML element of nested documents, even sentences or using two triples of ore:proxyFor (the archived file) and words, it seems rather contrived and inflexible. The major ore:proxyIn (the archive); but this reduces to the same limitation is that ePub archive resources are not identified problem of identifying the file. The ni URI [14] for the file by file paths, but must be addressable through rather rigid bytes can in theory be used to identify the file, but the other XML structures (order can’t change), thus this approach is missing information is the file path and name, which usually not appropriate for archives without an XML manifest. Even convey meaning for users. if using a RDF/XML manifest it would be inadvisable to The Research Object ontology’s FolderEntry specializes assume a fixed order of it’s XML elements. It seems however the ore:Proxy to add a property ro:entryName to indi- an appropriate reference scheme for ePub documents, which cate the filename, as exemplified in figure 2, but to find the full generallyhave a fixed reading order. archive file path one would have to traverse the parent folder’s ro:entryName. In either case there is no defined method IV. ARCP IMPLEMENTATIONS to predictably generate unique identifiers for the ORE proxies The arcp Python library [23] was developed to help cre- themselves, although the RO Bundle specification recommend ating, parsing and validating arcp URIs. In particular it can they should be randomly generated urn:uuid URIs, which generate arcp based on random UUIDs, URL locations, names would not be compatible with relative URIs within an archive. and hashing archive bytes. The arcp parser recognize the arcp prefix and can extract UUIDs or hashes, and can generate E. Publishing file systems as Linked Data the corresponding .well_known/ni URI for retrieving the F2R [18], using the Nepomuk File Ontology [19], defines archive. This library is meant to complement the Python 3 a way to publish file systems as Linked Data, where a server urlparse library, and so it is deemed out of scope for this endpoint exposes the files and their file system metadata. library to do resolution of arcp based on archive or network F2R URIs are localized to an endpoint and an free-text access. named file system, e.g. mysource, and files are identified The Research Object Bundle library, with UUIDs: part of Apache Taverna (incubating), is @prefix ore: . @prefix ro: . a ore:Proxy, ro:FolderEntry ; ore:proxyFor ; ore:proxyIn ; ro:entryName "survey.csv" . a ore:Aggregation, ro:Folder . a ore:Proxy, ro:FolderEntry ; ore:proxyFor ; ore:proxyIn ; ro:entryName "data/" .

Fig. 2. RDF Turtle example of how a file with the sha256 checksum 7f83b1...6d9069 could be described using RO folders and ORE proxies to belong to within the archive downloaded from

@base . @prefix nfo: . a nfo:ArchiveItem; nfo:fileName "survey.csv" ; nfo:belongsToContainer <24b34ecb-e46b-46ec-be36-a18dbba90247> . <24b34ecb-e46b-46ec-be36-a18dbba90247> a nfo:ArchiveItem; nfo:fileName "data" ; nfo:belongsToContainer <5d0a538a-ef00-48b6-bcb2-f561effe9fe5> . <5d0a538a-ef00-48b6-bcb2-f561effe9fe5> a nfo:ArchiveItem: nfo:fileName "archive13.zip" ; nfo:belongsToContainer ; pav:retrievedFrom . a nfo:Filesystem .

Fig. 3. RDF Turtle description of a file within an archive , using Nepomuk File Ontology [19], PAV [20] and F2R [18] identifiers. adding support for arcp URIs in its opening and creation from hashing the URI “location” of the activity, that would be of RO bundles, initially using the arcp UUID format as a confusing hack, as urn:uuid: references by design are not a replacement for app URIs, with planned support also resolvable, and hence technically not URLs. UUIDv5 hashing for hash-based identifiers and opening RO Bundles from a could however be appropriate for non-information resource if .well-known/ni endpoint. they have a resolvable http/https permalink. The CWLProv [24] approach for capturing prove- nance of executing Common Workflow Language is us- V. CONCLUSION ing arcp in its BagIt metadata bag-info.txt using This article propose the arcp identifier scheme for resources External-Identifier to identify its research object: within archives using formats like ZIP, tar and BagIt, and External-Identifier: suggest arcp is useful for identifying standalone Research arcp://uuid,d47d3d43-4830-44f0-aa32-4cda74849c63/ Objects and for processing Linked Data embedded in archives. draft-soilandreyes-arcp For CWLProv the use of arcp is crucial, as it assigns global The Internet-Draft [10] is identifiers for use across resources in the RO bag, including under consideration by IETF’s Applications and Real-Time the RO manifest itself and in W3C PROV file formats like Area to progress towards Informational RFC status. PROV-N and N-Triples, as neither format support relative REFERENCES URIs. In this approach the UUID component of the RO arcp iden- [1] J.A. Kunze, J. Littman, L. Madden, J. Scancella, C. Adams (2018): The BagIt File Packaging Format (V1.0) tifier d47d3d43-4830-44f0-aa32-4cda74849c63 also appears . Internet Engineering Task Force. https://datatracker.ietf.org/doc/html/draft-kunze-bagit-16 in the workflow provenance as the identifier of the top-level [2] Research Data Repository Interoperability WG (2018): Research Data workflow run (a PROV Activity): Repository Interoperability WG Final Recommendations. Research Data Alliance. https://doi.org/10.15497/RDA00025 prefix id [3] F.T. Bergmann, R. Adams, S. Moodie, J. Cooper, M. Glont, activity(id:d47d3d43-4830-44f0-aa32-4cda74849c63, M. Golebiewski, et al.,(2014): COMBINE archive and 2018-08-21T17:26:24.467636, -, OMEX format: one file to share all information to [prov:type=’wfprov:WorkflowRun’, reproduce a modeling project. BMC Bioinformatics. 15 369. prov:label="Run of workflow/packed.cwl#main"]) https://doi.org/10.1186/s12859-014-0369-z [4] Space Physics Data Facility (2016): CDF Internal Format This is showcasing how an RO that is the primary represen- Description, 3.6. NASA / Goddard Space Flight Center. tation of a non-information resource (e.g. a process) can be https://spdf.gsfc.nasa.gov/pub/software/cdf/doc/cdf364/cdf36ifd.pdf [5] The HDF Group (2016): HDF5 identified using a directly derived arcp URI. While this could Specification Version 3.0. The HDF Group. in theory also been achieved with an arcp UUIDv5 derived https://support.hdfgroup.org/HDF5/doc/H5.format.html [6] S. Soiland-Reyes, M. Gamble, R. Haines (2014): Re- ceedings of the 2007 Conference on Digital Libraries - JCDL ’07. search Object Bundle 1.0. researchobject.org Recom- https://doi.org/10.1145/1255175.1255190 mendation, Zenodo. https://w3id.org/bundle/2014-11-05/ [17] N. Ferro, G. Silvello (2013): Modeling Archives by Means of OAI- https://doi.org/10.5281/zenodo.12586 ORE, IRCDL 2012: Digital Libraries and Archives, pp 216–227. [7] System Applications Working Group (2015): The app: URL Scheme. https://doi.org/10.1007/978-3-642-35834-0 22 W3C Working Group Note 23 July 2015, World Wide Web Consortium. [18] Shaopeng He, Jianhui Li, Zhihong Shen (2013): F2R: Publish- https://www.w3.org/TR/2015/NOTE-app-uri-20150723/ ing file systems as Linked Data. 10th International Conference [8] M. C´aceres, K.R. Christiansen, M. Lamouri, A. Kostiainen, R. on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 767–772. Dolin, M. Giuca (eds.) (2018): Web App Manifest. W3C https:///doi.org/10.1109/FSKD.2013.6816297 Working Draft 04 July 2018, World Wide Web Consortium. [19] Ansgar Bernardi, Gunnar Aastrand Grimnes, Tudor Groza, Si- https://www.w3.org/TR/2018/WD-appmanifest-20180704/ mon Scerri (2011): The NEPOMUK Semantic Desktop. Con- [9] A. Russel, J. Song, J. Archibald, M. Kruisselbrink text and Semantics for Knowledge Management pp 255-273. (2017): Service Workers 1. W3C Working Draft https://doi.org/10.1007/978-3-642-19510-5 13 2 November 2017, World Wide Web Consortium. [20] P. Ciccarese, S. Soiland-Reyes, K. Belhajjame, A.J. Gray, C. https://www.w3.org/TR/2017/WD-service-workers-1-20171102/ Goble, T. Clark (2013): PAV ontology: provenance, author- [10] S. Soiland-Reyes, M. C´aceres (2018): The Archive and Package (arcp) ing and versioning. Journal of Biomedical Semantics 4:37. URI scheme. Internet-Draft draft-soilandreyes-arcp, Internet Engineer- https://doi.org/10.1186/2041-1480-4-37 ing Task Force. https://tools.ietf.org/html/draft-soilandreyes-arcp-03 [11] M. Kerwin (2017): The ”file” URI scheme. RFC Editor. RFC 8089 [21] James Pritchett, Markus Gylling (eds) (2017): EPUB https://doi.org/10.17487/RFC8089 Open Container Format (OCF) 3.1. W3C Member [12] T. Berners-Lee, R. Fielding, L. Masinter (2005): Uniform Re- Submission 25 jan 2017. World Wide Web Consortium. source Identifier (URI): Generic Syntax. RFC Editor. RFC 3986 https://www.w3.org/Submission/2017/SUBM-epub-ocf-20170125/ https://doi.org/10.17487/rfc3986 [22] Peter Sorotokin, Garth Conboy, Brady Duga, John Rivlin, Don [13] P. Leach, M. Mealling, R. Salz (2005): A universally unique Beaver, Kevin Ballard, Alastair Fettes, Daniel Weck (eds) (2017): identifier (UUID) URN namespace. RFC Editor. RFC 4122 EPUB Canonical Fragment Identifiers 1.1. Recommended Spec- https://doi.org/10.17487/rfc4122 ification 5 January 2017. International Digital Publishing Forum. [14] S. Farrell, D. Kutscher, C. Dannewitz, B. Ohlman, A. Keranen, P. http://www.idpf.org/epub/linking/cfi/epub-cfi-20170105.html Hallam-Baker (2013): Naming Things with Hashes. RFC Editor. RFC [23] S. Soiland-Reyes (2018): stain/arcp-py: arcp 0.2.0. 6920 https://doi.org/10.17487/rfc6920 Zenodo software http://arcp.readthedocs.io/en/0.2.0/ [15] M. Nottingham, E. Hammer-Lahav (2010): Defining Well-Known https://doi.org/10.5281/zenodo.1165986 Uniform Resource Identifiers (URIs), RFC Editor. RFC 5785 [24] F.Z. Khan, S. Soiland-Reyes, M.R. Crusoe, A. Lonie, R. Sin- https://doi.org/10.17487/rfc5785 nott (2018): CWLProv - Interoperable Retrospective Prove- [16] C. Lynch, S. Parastatidis, N. Jacobs, H. Van de Sompel, C. Lagoze nance capture and its challenges. In preparation. Zenodo preprint: (2007): The OAI-ORE effort: Progress, challenges, synergies. Pro- https://doi.org/10.5281/zenodo.1215611