Federated Search as a tool in Distance Education

Dr Peter Noerr, CTO, MuseGlobal, Inc., USA

Abstract

Institutions operating distance learning programs have challenges in information and resource supply not faced by their ‘campus bound’ counterparts. Students are scattered both spatially and temporally with students from across a country and beyond needing to study at all hours of the day. Access to basic materials online has become a staple requirement. Access to library collections and subscriptions is vitally important to enable students to conduct research and information gathering in an efficient environment, which will be with them in their later life. This paper describes federated search, discusses its current state, and the nature of the information landscape in which libraries currently operate. It then describes the creation, implementation and use of federated search in three cases. These include one of the largest distance learning universities in the world, a consortium of universities spread across the whole of the US, and one of the country’s larger state universities. From this commonalities and differences are highlighted. This paper is written by the supplier of the federated search system used by these organizations. It concentrates on the original design challenges and how the system was set up and operates for the different institutions, including how access to digital material is managed for local and consortial resources. It describes how the challenges were met, and what lessons were learned.

Introduction: challenges

Libraries have attempted for hundreds of years to both conserve the learning of humanity and make it available for succeeding generations of scholars and researchers so they may stand on the shoulders of their forebears. The advent of the digital and internet ages has brought both mighty new tools and massive challenges to this undertaking.

Not least of the challenges is the very success of the information curation and dissemination profession in making information almost instantly available. This has lead in large part to the increasing pace of development, nowhere more evident that in the information processing space itself. Thus not only do information specialists have to preserve and present the information from yesterday and today, but they have to do so in a manner which changes from year to year and to an audience whose skills and techniques change from month to month.

In this frenetic environment there is no possibility of successfully dissemination information to those who need it using the method of just a few decades ago. Even the institutions of education have changed, and will continue to change, at a pace that means libraries have to implement the use of cutting edge technology to maintain their place in the information food chain. The easy availability of software and hardware resources through decreasing costs and Open Source and commercial development has meant that almost anyone who wants to can create a “personal” library of the size of the great libraries of the world. This is happening at a commercial level as well as a personal and the major sources of academic literature, the authors, the universities, and the publishers are all producing databases (with attached search engines and web sites) of their material. Thus we have an information explosion, but one taking place in a rather strange way.

The volume of material available via the Internet (most through the Web, but not all) is growing exponentially. But this is by no means a measure of the growth of information. Whilst there is an exponential growth in newly created information, and possibly even knowledge, much of the actual data growth stems from two things: the creation of personal information collections, and the creation of personal ephemeral information (Facebook, Myspace, LinkedIn, Twitter, etc. – interesting in their own way but, generally, not of academic importance).

The OAI movement (Open Archives Initiative – www.openarchives.org) is an example of the former at an institutional level – it allows any organization to harvest from other co-operating organizations its own copies of the material they have. Much of this is metadata, but much of it is original full text. Thus the number of copies of a certain article multiplies. At a certain level this is no bad thing. You will hear at this meeting a presentation about LOKSS which is a movement to secure the preservation of material by maintaining many copies. Local copies also allow for local processing and special searching to be applied uniformly, and improve use response times and system robustness. But they do create yet another repository of commonly available material.

These various strands have lead to the creation of large numbers of silos of information which need to be searched if complete coverage of a subject is to be maintained. The uniqueness of some of this information and the ubiquity of other presents its own challenge as students and researchers do not wish, nor do they have the time, to be bombarded with the same articles from multiple sources. Removal of duplicate material without losing information is a challenge in today’s world.

Because these repositories are individually created, they have unique characteristics, and this provides a big challenge to the user in learning their intricacies so full advantage can be taken of the capabilities of the search engine. So we have a large number of repositories of primary information, with a large variety of access methods and modes of use.

Probably the biggest success story to date from the Web is the rise of Google as a near universal source of information of first resort, particularly to younger generations. Google is the premier, but not only example of a web search engine, and therein lie its powers and problems. In many ways Web Search Engines (WSEs) aim to be what libraries aim to be – the place to go for all information and answers. Whilst libraries are, by the nature of their creation, multiple, WSEs attempt to be a single entity for all, and for all purposes. Unfortunately their reach is not as broad as they would have us believe. Even with recent initiatives at making scholarly material available through the WSEs, it is estimated than they cover no more than 20% of the information available on the Internet. In addition the omnivorous nature of the WSEs and their “one size fits all” approach to indexing means that both search performance (what is retrieved) and quality of information (how authoritative is it) are severely compromised. Yet it is extensively used and quoted authoritatively just “because it is there”. Meeting the challenges, and adding more

In the academic world the first attempts at making information (in the form of journal articles) more widely and easily available by utilizing computing technology came in the 1970s when the “online databases became available for searching. These were accessible through individual networks usually through dial-up connections, held information from a single publisher (or even a single journal or organization), and each had their own search engines (with its own search language and record structure). These were the forerunners of many of our current search engines in ideas if not direct lineage. And they gave rise to the short-lived profession of “online search specialist” – those who knew the details of each database and could handle the different search languages to perform searches for end users.

With the advent of the internet and more universal high-speed access, the Aggregators came to the fore. These organizations took the material from the original databases and other publishers and aggregated it behind a second generation search engine. They reduced the number of search languages and increased the coverage of any one aggregator, specially as they tended to specialize by subject. These service have survived and grown to this day and provide access to hundreds of databases and thousands of journals through increasing sophisticated search engines – all of which are different. These are services we all know such as EBSCO, ProQuest, Cengage/Gale, Dialog, etc.

Open Source software (such as search engines like Lucene and its offspring) and Content Management Systems (CMSs) (such as Drupal and Joomla) and even full blown library management systems (such as Evergreen and Greenstone) and repository software (such as DSpace and Ketes) means that the ability of individuals and organizations to host their own systems now has an almost non-existent barrier to entry. The major cost is in time (and it is a major cost) and that is often cheap in the academic environment. So now we add these blossoming home-built systems to the mix. Built using a mix of systems with a very decidedly mixed level of talent, and a very uncertain commitment to continuity for many of them, they make much more generally high quality, information available, but well outside any form of organized access.

Until there is a truly universal single access to all these necessary forms of information, the user is back to finding those sources with the ‘best” information for their needs (no central registry or catalogue which doesn’t suffer from the same problems itself), and sequentially searching them. Really technology has left us not much better off in the hunt for information. We now have access to more of humanity’ stock (but, by no means all), but the finding of it is still a laborious process.

Federated search

Since information source keep proliferating as well as consolidating, we need some method to accommodate for this behavior and harness the abilities of the computer to make our searching task less onerous. Enter Federated Search.

Federated Search (FS) is often known as broadcast search, or global search, or MetaSearch. Recently Metasearch has come to mean the consumer oriented version which utilizes the popular Webs Search Engines as their sources, while Federated Search has become the preferred name for the academic and business version, searching databases, authenticated search engines, and even business applications.

Federated Search does:

 Provide a single access point to search multiple source of information  Use a single search language  Map multiple record formats to a common display  Merge results from all sources  Allow uniform manipulation of results (sorting, filtering, etc.)  Link the retrieved metadata records to full text articles  Handle authentication of users to information sources  Provide an ever extending set of searchable sources

Federated Search does not:

 Provide a perfect solution  Cover all information in the world  Access information which is not available through a network  Retrieve and manipulate complete sets of results  Provide extensive pre-coordination of records for specialist searching  Provide search capabilities above those of the individual source search engines

The above are lists which reflect the common features/failings of existing federated Search systems. Individual systems have additional capabilities above those mentioned.

Federated Search provides a simple to use, single access to a selection of information sources, and returns the results all in a nice combined list – what’s not to like? It turns out there are three main touted shortcomings of federated search. All are real to some extent, yet all are subject to the particular experience of the person listing the failings. They are simply: coverage, dumbing down, response time.

Coverage Federated Search systems all provide a common core of functionality and handle connections to individual source via a specialized piece of software called a connector (or agent, or translator, etc.) This software and its configuration data handle everything necessary to connect to and query and obtain results from a particular source. Like interchangeable drill bits they are plugged in when needed by the FS system. But connectors do not exist for all possible sources.

Federated Search is a natural poster child for the use of standards. If new systems are built to existing standards, then a FS system which has a connector for that standard can access them merely by configuration. This ability for expansion and re-use is what makes FS a viable proposition. The commercial vendors (such as MuseGlobal, Deep Web, etc. – see end of paper for list) spend a lot of time building, or providing toolkits for customers to build, connectors to new sources. In some form or other these then become available for all user organizations to incorporate in their systems if they need them. Some of these are built ‘on spec’ to cover a particular area of interest, others are built on demand. Some sources are popularly seen as “impossible to federate”. Usually this is because of a combination of authentication requirements, particular search functionality, volatility of data, lack of a standard interface, and difficulty of physical access. All of these are challenges and there are, undoubtedly many sources which are truly impossible to federate, but the tools of the modern FS system can handle most of them – to some degree. Federated Search is not a perfect science, but it does provide a substantial advantage when used properly.

All the commercial vendors claim “libraries” of hundreds or thousands of connectors to libraries and the aggregators and publishers. (MuseGlobal has a library of over 6,000 connectors) All of them cover more than probably 67% by use volume of the academic sources.

Dumbing down An initial analysis suggests that a FS system cannot provide search capabilities beyond those of the least capable search engine which is being queried. And this is possible the most common criticism leveled at federated search systems – basically it is limited to simple keyword searching. Leaving aside the fact that more than 90% of academic searching is just simple keyword searching and less than 3 keywords at that, this is simply not the case. At least it is not inherent in Federated Searching as an activity. Individual systems offer more or less capabilities.

It is perfectly possible, and is done by some systems, to map the user’s query in all its complexity to the search languages of those search engines which are equally capable. This means that searches through a FS system to just a single source can be made with the same precision as through the native interface.

However a FS system is best used to search multiple sources at once, so many such mapping are needed. This means that returned results are not actually all answers to the same question – depending on what the source is capable of handling. Recognising this is one of the requirements to be a serious FS researcher. As always the source of the information must be considered, but now the capabilities of the source must be considered as well. Some systems allow the user to designate whether they want a strict or lenient mapping so they can increase precision or recall as their current need dictates.

Response time Federated Search systems work by translating the user’s query and sending it to the sources systems, then gathering the results , processing them, and then displaying them. There is obviously more work to be done than in a search against the native source. Thus response times must be longer.

Not necessarily, and not necessarily in any meaningful way. Most modern FS systems add an over head of about 5% to the response time compared to the native source. So even with a source which responds in 10 seconds, the FS response time is 10.5 seconds – a hardly noticeable increase; and one which is well hidden by vagaries of network and even the source search engine response time.

Even allowing this overhead, it is an interesting observation that, on a modern multi-threaded FS system, the response time (as measured by the time to first result display) is likely to decrease as the number of sources is increased. This is because of the inherent variation of the response times of the sources, and the asynchronous nature of results processing. Many FS systems will return results as they are retrieved, not waiting for all of them before the user sees the first. Thus the addition of a new, faster, source may well bring down the response time. Of course, conversely, the response time for the last record depends on the slowest source – but so it does when searching that source.

That is an overview of Federated Search and its characteristics, let us now look at how one systems was implemented on three different environments, and see what the results and lessons are.

The players

MuseGlobal sells its systems through partners and one of the customers involved here is represented by one partners (Swets), another, at first one of the handful of direct customers, is now considering moving to a second partner’s offering (SirsiDynix).

LIRN (Learning Information Resource Network – www.lirn.net) is a consortium of 176 private colleges and universities, based out of the city of Largo in Florida. Between them they support a diverse learning population from all parts of the US. LIRN provides group purchasing and centralized management of electronic information resources for its members, and operates a validation gateway to vendor services. The LIRN virtual library provides students with millions of resources to support their academic studies.

The University of Indiana (IU – www.indiana.edu, also www.iub.edu for the main campus) is the state university and is headquartered in Bloomington, Indiana. It has campuses in eight other towns and cities across the state. Academics and research at Indiana University are supported by one of the finest university library systems in the nation. The 25 campus and special libraries on the Bloomington campus, 5 libraries at IUPUI, and modern libraries on each of the regional campuses provide students with an outstanding collection of resources, enhanced by trustworthy Web-based information. IU has approximately 5,000 faculty, and about 100,000 students, both undergraduate and postgraduate.

The University of Phoenix (UoP - www.phoenix.edu) is headquartered in Phoenix, Arizona. It is a for-profit organization and is the largest distance learning organization in the US and one of the largest in the world. It has over 200 campuses in hundreds of cities throughout the US and the world. It has approximately 300,000 students, both undergraduate and postgraduate, and 20,000 faculty. The different challenges

These organizations are vastly different, yet all focus on educational. One is a consortium of many small colleges and universities, one is a large state university, and the third is a massive distance education organization.

Despite their difference all have to provide their students and faculty easy and accurate access to the information they need for their studies. And manage that access efficiently and economically.

Information Sources As well as traditional bricks and mortar libraries with physical collections all three subscribe to multiple online information sources, have their own repositories of specialist material (IU has numerous local repositories/collections at its different locations about famous local people, places or events), and want to provide access to general information through a mediated environment.

Most sources for all organizations are authenticated commercially provided journal databases, though local sources are considered important, if few in number, and the freely available web sources are included in subject based requirements as appropriate.

UoP does not have physical libraries like those on the campuses of IU, or the different institutions of LIRN. As a distance learning only educator it provides much of the physical material for students by post or, increasingly, by download or on-line interaction with learning systems. Much of its specialized material are the courses developed by its faculty. This material is either delivered or made available as required, and is not generally searchable via the library and information systems. So UoP has essentially no link to an automated library system as IU and many of the LIRN institutions have.

Management LIRN is a management organization to handle specifically the day-to-day provision of information resources to the consortium members. IU is a single institution with a library management board within the university management structure. It has active faculty and student liaison committees for everything from collection development policy to UI design of systems. Phoenix is a corporation and thus is responsible to its board and shareholders. Its decisions can be more managerial, though it is still mainly driven by committees which have the added difficulty of being geographically dispersed.

Access LIRN subscribes as a consortium and member institutions also have individual subscriptions. Various members share costs, so it is necessary that access to the services which charge for volume of use is recorded on an individual organization basis, even though the subscriptions are managed centrally. This places a special requirement on the federated search authentication systems, and on gathering usage statistics.

IU subscribes a single body and also as individual libraries. Its internal databases are selectively available where the particular library of portal finds them appropriate. As a group of physical libraries it also has subscriptions to physical material and to tie that material into the on-line accessible catalogue needs access to the local ILS as well as journal subscription lists and link resolvers.

Phoenix subscribes as a single organization and provides widespread uniform access to its students across virtually all its resources with the majority of them being made available by default, though some campuses do have special access to material only they need.

The common solution

Across these three organizations there was no common driving factor to adopt Federated Search and no particular set of requirements they all decided they “must have”.

The driving reasons (in no particular order) were:

 Single point of access  Ease of use  Aggregation of results, but separation of resources  Usage accountability  Usage statistics  Accurate, complete coverage of resources

Obviously some of these relate to providing an improved service to the users, and others to the housekeeping issues of ensuring cost-effective use of resources, and cost reduction and assignment.

Through different channels they all selected a system based on the Muse software as the best choice for their needs. Different criteria were used for the final choice and they remain with the institutions, but some are common and important.

 Complete coverage of subscription sources  Ability to add local sources  Customised Applications for different campuses (look & feel, functionality, sources)  Authentication and authorization for users  Authentication of sources  Precise (customized for each source) searching  Ease of deployment, locally or as a hosted solution  Web based administration  Source maintenance service from MuseGlobal  Reasonable cost

These break down into five main areas which were considered important: Coverage, functionality, customizability, maintainability, deployment, cost.

Functionality and customizability are features of the particular software that MuseGlobal has developed and may be considered to be intrinsic to the system. Other federated search systems have different approaches and provide more or less of both, but Muse had the right mix. However taking these two plus coverage displayed an important aspect of the choices and subsequent implementations. In all cases MuseGlobal spent a lot of time with the various committees and groups and individuals from the organizations, deciding on what systems should look like, how they should perform, and what sources they should connect to. The nature of the software and, in particular for coverage, Muse’s Source Factory and ‘build on demand’ approach to connectors meant that virtually all the diverse needs of the institutions could be met.

Coverage and maintainability combine to emphasise a particular characteristic of federated search software. It is not a system which can be installed and then allowed to run for ever. Because the sources change on an irregular, but fairly frequent basis, the connectors constantly need to be “fixed” to work with the newly changed source. Thus maintenance becomes a much larger part of the cost equation then with other, equivalently complex, software systems. The federated search vendors are divided over the issue of connectors. Some provide a toolkit and allow administrators to build their own, other (including MuseGlobal) build and maintain connectors for all customers as part of an ongoing service. Building you own may get you a connector quicker, though that is doubtful since the staff of the service vendors are experienced at the task, but local institution staff have to be assigned and trained to maintain the connectors. And this is a continuous job. Across a number of institutions it is also a wasteful activity, with many solutions to a single problem. Muse handles this by having the user organizations share the reporting of problems (shared also with an automated “source checker” which runs every day checking thousands of sources), and has a centralized reporting system, fixing service, and automated deployment of fixes, all run through its Source Factory. Other vendors adopt alternative, generally less automated approaches.

Maintainability, deployment and cost come together in the ability of the software to fit the computing needs of the different organizations. Two have the software deployed locally (one – IU- had a copy deployed at each campus, though they have since moved to a central hosting in Bloomington for all campuses), and one (UoP) utilizes a completely hosted system running on a server cluster at one of MuseGlobal’s server locations. These aspects also lead to an early adoption by virtually all federated search system vendors of a subscription model. Some more traditional IT departments prefer a ‘license and maintenance’ model, but most libraries are familiar with the increasingly popular subscription model and appreciate the fact that the work for these systems is ongoing, and also that it allows them to spread the financial load over a number of years, but have the ability to terminate the subscription when circumstances change.

The installed systems have all been in place for at least two years, and have undergone regular functionality and performance improvements and are now fully integrated into the workflow, processing and management of the different organizations. Over the period sources have changed and been fixed, sources have been dropped and new ones subscribed to, and the system has continued on. User interface have undergone many changes and are currently indistinguishable from the other components of the student and faculty workflow. Ajax is a technology used by federated search before it had a name, and contributes to the smoothness of information delivery and user interaction. Continued enhancement of functionality means the systems are currently as “web 2.0” as the different organization want.

The proof of these puddings is in the eating, and all three organizations have renewed their original contracts, and use keeps growing. The presentation will show screen shots of the existing systems and may include a live demonstration. But to indicate just the growth of traffic this chart shows the network bandwidth used by one of two clustered servers for the UoP system over the last 15 months. Lessons Learned

Not surprisingly the majority of the lessons learned are not technical ones. Preparation and organization by the customer organization are paramount, as is a commitment to the fact that federated search is not a “plug and forget” system – it needs constant care, and some of that care must come from the users and administrators. They have to make their needs and desires known in a coordinated fashion (just as for any evolving software system), and they have to report when things are not working.

Specifically important points to consider are:

 User demographics; who are the users and what are their skill levels?  How will the system be accessed by users; on campus or off?  Customisation and personalization; is it available?  Source links: how are these organized and displayed to users?  How are sources grouped and displayed to users?  How does the grouping of source relate to groups within the user population?  Do departments get their own Application or share a common one?  Are different services from the system made available to all users?  How are users authenticated and how does that map to the FS system?

User and administrator feedback is vitally important, specifically in areas such as:

 User Interface design  Sources available  Source selection methods  Integration with other tools  Processing and manipulation of results

Sustainability of the federated search system needs work and it is a task to which the institution has to commit resources and time. For consortia the cost savings can be compelling, but while source licenses may cost less, the administration and management workload increases substantially. Federated search removes much os the technical workload, but management and contractual issues still remain and must be handled in a manner satisfactory to the consortium members and the content vendors. Typical issues are:

 Are sources licensed by subscription or ‘pay-per-view”?  Are source bought by the consortium or by members/  Are both models allowed?  How do members join and leave a subscription?  How are private subscriptions managed?

At a detailed system level it is necessary for the organization to decide on how the federated search system will be used, both its internal operation and how it interacts with other systems. Questions such as:

 What major functions are required/  How are they presented to different user groups?  What sources are needed, and how are they segmented for users?  How does the FS system fit into different data flows for different user groups?  How is security managed?

At the computing level another set of considerations need to be addressed before embarking on a FS installation, and then re-visited over time to plan for the future.

 Where will the system be installed?  What traffic will it generate?  How many users will there be?  How will they grow over time?  What sort of redundancy is needed?  What existing infrastructure can be used?  What will need upgrading?  Security of access

These are a set of important points to consider in different areas. Answers to all of these are needed to make a success of a federated search installation. Issues such as publicizing the service, user feedback, beta testing, special circumstances modifications, changes and adaptions, new collections, technology changes, and a number of others have not been mentioned in any detail, but are there for resolution.

The good thing about federated search systems is that they are eminently flexible, make life easier for users and administrators, and provide an educational experience that is improved because of their existence.