Experiences in using the OAI-PMH through the construction of the OA- Hermes Metasearch engine

Egar Arturo García Ana Patricia Gómez Grecia García García Alberto Castro Cárdenas Mayén Thompson Main Directorate of Main Directorate of Main Directorate of Libraries, National Main Directorate of Libraries, National Libraries, National Autonomous University of Libraries, National Autonomous University Autonomous University México Autonomous University of México of México Tel: +52 (55) 56223969 of México Tel: +52 (55) 56223969 Tel: +52 (55) 56223969 [email protected] Tel: +52 (55) 56223969 [email protected] [email protected] [email protected]. mx

The technical experience obtained in Abstract: analyzing the OAI-PMH protocol, led us to reflect on different aspects that have to OAI-PMH has emerged as a more do with: work methodology, the efficient way of facilitating the marketing or fashion of its use, its facility dissemination of data content. Its success of implantation, or on the information is shown when displaying important redundancy, just to mention a few. information sources that use this protocol, emphasizing: Scielo, ArXiv and PubMed, Key words: among others. Nevertheless, its evolution proves a certain degree of arrearage OAI-PMH - OA-Hermes – Meta search before the current tendencies and engine - Exchange information Protocols demands of the new technological needs, - HTTP - Z39.50. that is to say, the latest technical and methodological characteristics that allow 1. Introduction a potential operation of information, are continuously required. The first part of the present paper consists in a brief and a step by step description of In order to make the new necessity clear OA-Hermes; then, to frame the resulted on this recognized form of experiences, some analysis will be interoperability, the experience gained presented, as well as some final during the development of the metasearch considerations. First, it is necessary to engine OA-Hermes, which groups establish that OAI-PMH is an initiative different sources of information in a that emerges as an additional option to single interface, will be presented. A facilitate the dissemination of digital decrease in time as an achieved factor has contents on the Internet. As a collective to do with the consultation of diverse example of this, there are important and sources of information, since from a popular initiatives that single search, the results of different currently offer their services in such a sources like institutional repositories, way. However, it is important to mention digital libraries, and data bases, among that the referred protocol presents certain others, are semantically integrated. technical delays in its evolution up to this moment, which should be known,

1 considered and taken into account, as well of Open Access resources; OA-HERMES as the advantages offered. objectives; OAI-PMH advantages; OA- HERMES conceptual development; detected problems in the construction of Along the course of this paper, on one OA-HERMES; some OAI-PMH hand, aspects such as work methodology, disadvantages; OA-HERMES the marketing or fashion of its use, its characteristics, and final considerations. certain facility of implantation and information redundancy, will be detailed to depth. 2. Integration Proposal of Open Resources Access: OA-Hermes On the other hand, it is necessary to mention that there have been some From the beginning of the WWW, detected disadvantages in the OAI-PMH outsider information systems that already protocol that will be described along this were on line, have had the tendency to work, although this is not meant to migrate to this environment, bringing disqualify it or to suggest stop using it. within an important increase in On the contrary, the intention is to extend communication protocols, information the knowledge about it by considering its resources, and communication supporting bases. To achieve this, all the standards, search engines and indexers, e- experiences resulted from the commerce, e-science and so on, coming construction of OA-Hermes will be set to conform, apparently, a parallel world out. called "e", initial placed before almost any term.

Among the several conflicts that appeared This great increase of the e-world and initially, there was the lack of methods to sources of information has generated a suitably explore the resources from an new accessory in our lives, which has information source. This resulted in forced us to think: "If I can’t find making local copies of all the collections something on the net, it does not exist". It offered by the original source, for their has been clear that, on one hand, it process. This demands the availability of facilitates rich contents and digital more hardware and software from the services to numerous communities; it also collector. carries several kinds of problems within. That is to say, the third law of Newton is In general terms, the experiences in using still valid: “For every action there is an the OAI-PMH while constructing the equal and opposite reaction”. meta search engine OA-HERMES are resumed in a critical perspective, solidly Perhaps one of the main framed reactions based, that makes emphasis in extending or problems observed is the or evolving OAI-PMH as soon as heterogeneous exponential growth of possible for those interested in information, along with everything implementing it. With these aspects in involved. mind, the following sections, that took place in such a way, are to be considered Great efforts are being analyzed and through this paper: integration proposal developed to ease and improve the access

2 to information, yet, it is still complicated some financing, so it was presented at to access articles, journals, books or CUDI 2004 (University Partnership for another type of digital resources on a the Development of Internet 2 in certain subject. It is also true that not all Mexico). Since the call for papers the blame can be attributed to the protocol requested the inclusion of two educative or system. In several occasions, it is the institutions in the country, the University users who are not familiar with the search of Colima was invited. interface of the resource, nor with the great amount of sources that are available to them. In addition to the previous cases, OA-HERMES Objectives there are those in which the users carry out repetitive searches in different sites The main objective of OA-Hermes is to from the Internet, making the retrieval of group several sources of information in a their information difficult; and the single interface, in this way, when a user different forms in which the results wishes to make a search, he only needs to obtained are displayed by each use the interface that OA-Hermes offers, information source, also becomes an which directs the search to each one of obstacle to be surpassed by the end user. the sources of information chosen.

Taking into account the reasons and In the conception OA-Hermes the problems stated before, OA-Hermes following objectives were considered: (meta search engine and inter-connector for information sources of open access) • · The incorporation of reliable and emerges with the purpose of facilitating high quality sources of and reducing the time invested in information. searching and retrieving open access • · The access to specialized sources information with an academic validity. of information many of which are Furthermore, OA-HERMES is a tool that within the Invisible Internet. favors the integration of collections and • · To take advantage of those open repositories of educative institutions in access resources thus enriching Mexico. the digital libraries of the academic institutions, especially OA-Hermes was gestated within a those that have limited economic teamwork of the Main Directorate of resources for acquiring or Libraries, the Institute of Cellular subscribing digital resources. Physiology and the Institute of • · To construct a modular system Biotechnology of the National that would allow its growth and Autonomous University of Mexico. diversification. When analyzing its potential, the proposal • · To favor the visibility of to develop it as a tool available not just electronic resources produced in for the UNAM, but to open its use and Mexico. consultation for the academic community of the country and the Internet as well, OA-Hermes Conceptual Development arises. OA-Hermes organizes the obtained By the end of 2004, OA-Hermes needed results from the information sources to be

3 shown to the user later on. Before different information sources and presenting the data there is a process of obtain the results from them. semantic integration, in which the These unify the obtained data and metadata of the recovered information are send them to the nucleus for their extracted and later unified for their management. presentation to the user and for the additional processes that could be OA-Hermes Characteristics. required. OA-Hermes is a proposal aimed to save For the conceptual OA-Hermes design, time for those looking for information on the following criteria were considered: the Web. Instead of going to multiple sites and learning to use their respective 1. Extensible interfaces, the user can simply use the 2. Configurable single interface that OA-Hermes offers. It 3. Concurrent Searches, under user’s is worth mentioning that one of the demand. objectives with which OA-Hermes was 4. Flexible information management conceived, was the simplicity in its 5. Response time internal design and in its user interface, in 6. Capacity to be developed by a addition to a low cost of the architecture group. on which it works.

On the basis of these established criteria, Making a brief comparison to other an architecture based on three main search engines, it is important to mention components was obtained: that those used within the Internet, store indexes to organize the information on 1. The Nucleus, which stores and the part of the Web that is covered. This handles the obtained data from the requires an enormous amount of storage searches in the different resources. Nevertheless, these search information sources. It also directs engines do not assume that the included the searches to the selected sources of information have their own sources and provides the results search mechanisms. OA-Hermes tries to according to the user’s demand. avoid the great amount of information deposits by taking advantage of the 2. The interface shows the results to storage and search mechanisms that the user. XML is internally different sources of information offer. handled to display the results, but in order to show the results in The information sources that are different formats, XSL style integrated in OA-Hermes handle different sheets can be included. In OA- communication protocols and formats to Hermes a style sheet that is used display the results. It is frequent to find to display the results in HTML is sources that use Z39.50 as a protocol and included. This format is the one MARC or SUTRS as presentation that is handled by default. formats. It is also frequent that HTTP is used as a protocol and HTML as a format. 3. Search engines are programs that Furthermore, OA-Hermes includes the connect themselves to the OAI-PMH sources to share their

4 information. The one this paper is focused Web. It is constructed on HTTP by means on is the Z39.50 protocol. of commands used by GET or POST methods of the same protocol. These In relation to the operating system commands are sent to a server which environment, OA-Hermes was built with processes the request and sends the the idea of being a multi-platform, using results back. Finally, the results are the Java programming language and the presented in XML and, generally, under Servlets technology. In order to put it on the norm. the Web, Tomcat, in communication with Apache, was used. In the beginning, it OAI-PMH offers a series of mechanisms was decided to use Perl language for its to obtain the resources that a source of development but after a careful information has, and the results can be evaluation, Java was used due to its obtained within a rank of dates (which concurrence handling capacities, IP at can be open, that is to say, without several levels, modularization and specifying initial or final date or both). It documentation. is important to mention that the OAI- PMH does not have a sophisticated search mechanism, that is, the results cannot Detected Problems in the construction of possibly be selected by another type of OA-HERMES. criteria (author, title, subject, etc., or a combination of these). To summarize, up Here is a list of the most relevant to this moment, the only way to choose problems faced while constructing OA- the results from an OAI source is by HERMES: means of the registration date.

• Incompatibilities and uniformities Sometimes the results obtained from an with Z39.50 protocol (searches, OAI request are too many, thus, these are formats). presented in pages that have a limited • Too heterogeneous search number of results (30, 50 or 100 are mechanisms. common values). Each page shows an • Open Archives protocol identifier that corresponds to the limitations (connection following page of the results sequence; in mechanism). this way, in a new request, the mentioned • Heterogeneity in language and identifier is included to collect the characters codification. subsequent data. • Availability, server’s response and connection times. Without the intention to totally focus this • Incompatibility among different work in which OAI-PMH technically browsers. works, the next sections present, display and document what is considered as the protocol disadvantages, or lack of 3. Some OAI-PMH Detected maturing. Disadvantages I. Harvester’s storage of the obtained OAI-PMH as we know it is a protocol results used to share information through the

5 At this point it is necessary to remember The time used for the harvests can be that the systems that provide services considerably long if the information with OAI do not provide enough elements sources have a high number of results. for the exploitation of the given data. In Beginning from ten thousand results, the order to exploit the information that an harvest times can be considered in hours, OAI source has, it is necessary to store it and if the numbers are higher (millions), in a local way, to subsequently, by means we could be talking about days. of software, manage it as desired. For a small number of results this could not be For example, the harvest time required to a significant problem, but when the retrieve the approximately 60.000 records information source has a higher number from Scielo, was of four hours. of results, special resources are required to maintain and to manipulate this III. Information Redundancy information locally, this is, hardware and software are required for the storage of Among the initial objectives in the information and database management creation of OA-Hermes is the avoidance respectively, which entails additional of information redundancy, and that is costs for those interested in operating this why the use of the resources provided by tool. the sources for data exploitation is preferred. Nevertheless, there are For example, in the case of Scielo, OA- important OAI sources that must be Hermes extracted approximately 60.000 integrated and, since the OAI-PMH does records, which required a storage space of not provide the operations needed for 120 MB using a data in MySql. taking advantage of the information; we are forced to harvest it, that is, to maintain a local copy. II. Time allotted to information harvesting and updating The same case (the creation a local copy) will be present for all those who wish to For the extraction of information from the use the information from an OAI source, OAI sources, it is necessary to consider which will cause problems like those the time allotted for the connection and mentioned previously: the increase in the information transmission, that is, the hardware and software requirements; and harvest time. There are two types of (for the OAI source users) loss of harvests: the initial harvest and the reliability in data that is not updated, in updating harvests. The initial one occurs addition to decreasing the availability in in the first connection that is made to the the system when updating the harvests. information source; here, it is expected to Furthermore, the problem is repeated as extract most of the results that the source many times as OAI sources are added to has. The updating harvests are used to the application that might try to make use locally maintain the information stored to of them. the day; these harvests can be programmed in a certain interval of time IV. Data Granularity so as to not highly affect the performance of the service offered. The results obtained by the OAI-PMH are presented with a series of Dublin Core

6 metadata that are simple and clear enough 5. Most of contents that use the to be presented to the user. Nevertheless, protocol display an open access the Dublin Core elements do not offer modality. enough granularity when additional processes with the harvested information are required. Final considerations

In OA-Hermes it is sometimes required to From the OA-Hermes construction point recover the full text of the results of view, the purpose of the OAI-PMH to obtained by an OAI source, with that in help facilitate the efficient dissemination mind, it could be necessary to consult of contents has not been fulfilled entirely, data bases that contain the reference to it, since it breaks the initial ideas of the OA- or to obtain a URL that leads directly to Hermes project. the full text. The information needed to solve the text (for example the journal, During the construction of OA-Hermes, volume and issue) can be in one or more several problems were detected, to which Dublin Core labels, although sometimes probably, those wishing to operate OAI the format is not standard, which adds an resources coming from remote sources of additional process to enable the extraction information, will have to face. of the required information. Another important aspect is to know OAI-PMH Advantages. when the OAI-PMH is really required or needed, because if its use is not analyzed, OAI presents advantageous it is possible to duplicate, triplicate or characteristics that without any doubt quadruplicate the contents within a same have contributed to its great success at the institution, causing unwanted present time. These are: requirements of hardware, software and personnel that can become excessive 1. The use of standard formats for maintenance expenses. data interchange. 2. The use and exploitation of XML The series of problems related to some for the treatment of the extracted disadvantages in the OAI-PMH had to do information. with the immature techniques it uses, 3. The use of URLs for the although it is recognized as a protocol identification of resources which that has come to stay. allows taking advantage of the HTTP, the most used and It is also understood that several protocols common protocol for information that are known as standard up to now, exchange on the Web. have been evolving in the course of the 4. The use of Dublin Core to provide time and they have vanished without a unified platform for the leaving a sign. identification and use of metadata, that, although for our aims is not We state that it is necessary to make suitable enough, it does simplifies extensions that incorporate a greater the semantic integration process number of flexible and advanced search of the information. mechanisms in order to facilitate and

7 meet the current needs of the institutions using the OAI-PMH protocol. These CUDI Reunión de otoño. Octubre 2005. considerations might improve the use and http://www.cudi.edu.mx/otono_2005/inde visibility of the contents which are, in x.html larger number, open access. CUDI Reunión de primavera. Abril 2005. References: http://www.cudi.edu.mx/primavera_2005/ index.html Van de Sompel, Herbert ; Lagoze, Carl (ed.) (2004). The Open Archives Jenn Riley (2005). OAI Best Practices. Initiative Protocol for Metadata http://oai-best.comm.nsdl.org/cgi- Harvesting. bin/wiki.pl?DigitalTactileResource http://www.openarchives.org/OAI/2.0/op enarchivesprotocol.htm Meta-Search Engines. 2005. http://www.lib.berkeley.edu/TeachingLib/ Van de Sompel, Herbert ; Lagoze, Carl; Guides/Internet/MetaSearch.html Michael Nelson; Simeon Warner (ed.) (2005). Implementation Guidelines for Van de Sompel, Herbert (2003) The OAI the Open Archives Initiative Protocol for and OAI-PMH: How did we get here, and Metadata Harvesting. where do we go from here?. Delivered at http://www.openarchives.org/OAI/2.0/gui 3rd. Open Archives Forum Workshop, delines.htm Berlin. Presentation.http://eprints.rclis.org/archiv CUDI Reunión de primavera. Abril 2006. e/00001157/02/berl_desompel.pdf http://www.cudi.edu.mx/primavera_2006/ programa.htm

8