1

Digital Library and Web Technology

Course Material

LIBRARY AND INFORMATION SCIENCE

By Dr. R. Balasubramani

Assistant Professor

Department of Library and Information Science

BHARATHIDASAN UNIVERSITY

TIRUCHIRAPALLI – 620024

TAMIL NADU

2

Course -1.4: Digital Libraries and Web Technology Unit 1: Digital Libraries: Definitions, Fundamentals and Theoretical Aspect; Characteristics of Digital Libraries and nature of Collections-Major Digital Library Initiatives, Open Archives Initiative (OAI) and similar developments. 1.1 Introduction of Digital Libraries Traditionally, Digital means the use of numbers and term comes from digit or finger. Today, digital is synonymous with computers. Digital means original. The 0s and 1s of digital data means more than just and off. They mean perfect copying when information, music, voice and video are turned into binary digital form, they can be electronically manipulated, preserved and regenerated perfect at high speed. The millionth copy of a computer file is exactly the same. Digital libraries are complex information system. Their design, development, management and use require application of scientific, technological, methodological, economic, legal and other innovations. Digital library technologies are rapidly developing and are still evolving. A digital library is a collection of information that is stored and accessed electronically. The information stored in the library should have a topic common to all the data. For example, a digital library can be designed for computer graphics, operating systems, or networks. These separate libraries can be combined under one common interface that deals with computers but it is essential that the information contained within each library remain separate. 1.2. Definitions Digital Libraries has been defined in many ways. The real meaning of Digital library is a managed collection of information with associated services, where the information is stored in digital formats and accessible over a network.  According to Michael Lesk, “Digital Libraries are organized collections of digital information. They combine the structuring and gathering of information, which libraries and archives have always done with the digital representation that computers have made possible”.  According to Ian Witten and David Bainbridge, “Digital libraries are a focussed collection of digital objects including text, video and audio, along with methods for 3

access and retrieval, and for selection, organisation, and maintenance of the collection”.  According to Larsen, “Digital Library is a global virtual library-the libraries of thousands of network electronic libraries”.  According to Digital Library Federation, “Digital Libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities”. 1.3 Characteristic of Digital Libraries  Network accessibility  User friendly interface.  Advised search and retrieval  Supporting multimedia content  Accessing from anywhere like home, school, libraries, during travel, etc  Proving access to very large collections including access to primary and secondary information  Availability for long time  Greater opportunity for publishing etc.  Saving space which is required for physical documents  Protecting rare books that are rapidly deteriorating due to over use and poor storage condition. 1.4 Digital Library Collection  describing the data and having faculty to provide links to other related database.  Open communication protocols (client-server) for ;  Information access tools like; browsers, display and search engines;  Electronic publishing tools;  ;  Digital storage devices  Scanning and conversion technologies;  Privacy and security  Inter-operability over the libraries; 4

 Advanced information retrieval tools like indexing routing and filtering of information; and  Intellectual property rights (IPR) 1.5 Software for building and distributing digital library collections: 1.5.1 Greenstone Greenstone is a suite of software for building and distributing digital library collections. It is not a digital library but a tool for building digital libraries. It provides a new way of organizing information and publishing it on the in the form of a fully-searchable, metadata-driven digital library. Greenstone is produced by the New Zealand Digital Library Project at the Department of Computer Science, University of Waikato, and distributed in cooperation with UNESCO and the Human Info NGO based in Antwerp, Belgium. Greenstone is supported by UNESCO and distributed freely through WWW and CD-ROMs as part of its Information for All programme. The aim of the software is to empower users, particularly in universities, libraries and other public service institutions, to build their own digital libraries. Developers can choose among various interfaces to build collections and customize the end-user interface.

Greenstone requires no specific hardware requirements. Greenstone runs on all versions of Windows, and Unix/Linux, and Mac OS-X. It is very easy to install. For the default Windows installation absolutely no configuration is necessary, and end users routinely install Greenstone on their personal laptops or workstations. Institutional users run it on their main web server, where it interoperates with standard web server software (e.g. Apache). Greenstone 3 is a complete redesign and reimplementation of the original Greenstone digital library software (Greenstone2). It retains all the advantages of 5

Greenstone2 like multilingual support, multiplatform support, and high configurability. It incorporates all the features of the existing system, and is backward compatible, that is, it can build and run existing collections without modification. Written in Java, it is structured as a network of independent modules that communicate using XML, thus it runs in a distributed fashion and can be spread across different servers as necessary. This modular design increases the flexibility and extensibility of Greenstone. 1.5.1.1 Features of Greenstone Greenstone is a suite of software for building, cataloguing and distributing digital library collections. Greenstone makes collections highly accessible because search can be carried out at various levels - that of a document, a chapter, as well as a section. The following are the features supported by Greenstone:  Multiplatform availability: Greenstone is available for various operating system platforms, including Windows (any version), Linux, Sun Solaris, and Mac OS X.  Access and distribution: A Greenstone Collection can be served on the or it can be exported to a CD-ROM and accessed from the CD-ROM or local hard disc without the need for Internet connectivity.  Collection building: Supports a variety of interfaces for collection building.  Powerful indexing: Greenstone can build indexes from full text documents and also metadata associated with these documents. It supports creation of indexes for various metadata fields, either automatically extracted or manually assigned.  Powerful search and browse: Since Greenstone does full text and field based indexing, users are provided with a variety of search options.  : Greenstone supports different file formats. These file formats are converted into a standard XML-based internal format for indexing using "plugins". Plug-ins is used to ingest documents. For textual documents, there are plug-ins for: PDF, PostScript, Word, RTF, HTML, Plain text, Latex, ZIP archives, Excel, PPT, (various formats), source code documents and Foxpro database. For multimedia documents, there is plug-ins for: Images (any format, including GIF, JIF, JPEG, TIFF), MP3 audio, Ogg Vorbis audio, and a generic plug-in that can be configured for audio formats, MPEG, MIDI, etc.  Extensibility and configurability: New plugins can be developed for file formats not supported by Greenstone. Greenstone allows configuring a collection to customize the 6

interface, indexing, browsing and presentation features according to local requirements.  Multilingual support: One of Greenstone's unique strengths is its multilingual nature. Unicode, an encoding standard for representing a large number of scripts, is used throughout Greenstone. This facilitates building, searching and browsing documents in any Unicode-compliant . The Librarian interface and the full Greenstone documentation (which is extensive) are available in English, French, Spanish, Russian and Kazakh. The reader's interface is available in the following languages: Arabic, Armenian, Bengali, Catalan, Croatian, Czech, Chinese (both simplified and traditional), Dutch, English, Farsi, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese and other Indian regional languages.  Interoperability: Greenstone is highly interoperable using contemporary standards. It incorporates a server that can serve any collection over the Open Archives Protocol for Metadata Harvesting (OAI-PMH), and Greenstone can harvest documents over OAI-PMH and include them in a collection. Any collection can be exported to METS (in the Greenstone METS Profile, approved by the METS Editorial Board and published at http://www.loc.gov/standards/mets/mets-profiles.html), and Greenstone can ingest documents in METS form. Any collection can be exported to DSpace ready for DSpace's batch import program, and any DSpace collection can be imported into Greenstone.  Interfaces: Greenstone has two separate interactive interfaces, the Reader interface and the Librarian interface. End users access the digital library through the Reader interface, which operates within a web browser. The Librarian interface is a Java- based graphical user interface (also available as an applet) that makes it easy to gather material for a collection (downloading it from the web where necessary), enrich it by adding metadata, design the searching and browsing facilities that the collection will offer the user, and build and serve the collection.  Metadata formats: Users can define metadata interactively within the Librarian interface. These metadata sets are predefined:  Dublin Core (qualified and unqualified)  RFC 1807  NZGLS (New Zealand Government Locator Service) 7

 AGLS (Australian Government Locator Service) New metadata sets can be defined using Greenstone's Metadata Set Editor. "Plug-ins" is used to ingest externally-prepared metadata in different forms, and plug-ins exists for: XML, MARC, CDS/ISIS, ProCite, BibTex, Refer, OAI, DSpace and METS.  Exporting collections to CD-ROM: key feature supported by Greenstone is exporting digital library collections to CD-ROM. The exported collection can be accessed directly from the CD-ROM without requiring any installation or can be installed on the hard disc. This feature, however, is available for Windows platform only. Utilization of this functionality requires the CD-ROM export module. This module is automatically installed when installing Greenstone. 1.5.2 Dspace The first version of DSpace was released in November 2002, following a joint effort by developers from MIT and HP Labs in Cambridge, Massachusetts. DSpace is a groundbreaking digital repository system that captures, stores, indexes, preserves and redistributes an organization's research data. DSpace is an open source software package that provides the tools for management of digital assets and is commonly used as the basis for an institutional repository. It supports a wide variety of data, including books, theses, 3D digital scans of objects, photographs, film, video, research data sets and other forms of content. The data is arranged as community collections of items, which bundle bitstreams together. DSpace is also intended as a platform for digital preservation activities. Since its release in 2002, as a product of the HP-MIT Alliance, it has been installed and is in production at over 240 institutions around the globe, from large universities to small higher education colleges, cultural organizations, and research centers. DSpace is the software of choice for academic, non-profit, and commercial organizations building open digital repositories. It is free and easy to install "out of the box" and completely customizable to fit the needs of any organization. DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets. 1.5.2.1 What Can DSpace Do? Jointly developed by MIT Libraries and Hewlett-Packard Labs, the DSpace software platform serves a variety of digital archiving needs. Research institutions worldwide use DSpace to meet a variety of digital archiving needs: 8

 Institutional Repositories (IRs)  Learning Object Repositories (LORs)  eTheses  Electronic Records Management (ERM)  Digital Preservation  Publishing and more 1.5.2.2 What Kinds of Content Does DSpace Accept? DSpace accepts all forms of digital materials including text, images, video, and audio files. Possible content includes the following:  Articles and preprints  Technical reports  Working papers  Conference papers  E-theses  Datasets: statistical, geospatial, matlab, etc.  Images: visual, scientific, etc.  Audio files  Video files  Learning objects  Reformatted digital library collections 1.5.2.3 Benefits of Using DSpace  Getting your research results out quickly, to a worldwide audience  Reaching a worldwide audience through exposure to search engines such as  Storing reusable teaching materials that you can use with course management systems  Archiving and distributing material you would currently put on your personal website  Storing examples of students’ projects (with the students’ permission)  Showcasing students’ theses (again with permission)  Keeping track of your own publications/bibliography  Having a persistent network identifier for your work, that never changes or breaks  No more page charges for images. You can point to your images’ persistent identifiers in your published articles. 9

1.5.2.4 Building an Institutional Repository with DSpace Each DSpace implementation is unique. While the technology is fairly easy to install and setup, designing and building your institutional repository service with DSpace requires planning upfront, before you build the technology platform and launch your service. 1.5.2.5 Planning & Implementing an Institutional Repository To help you plan and build your DSpace implementation, we offer planning tools and content focused on each stage of your DSpace project:  Defining Your DSpace Service Offering  Implementing and Operating DSpace  Creating a Service Support Infrastructure  Marketing the DSpace Service  Building Communities and Collections 1.5.3 Eprints EPrints is an open source software package for building open access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting. It shares many of the features commonly seen in Document Management systems, but is primarily used for institutional repositories and scientific journals. EPrints has been developed at the University of Southampton School of Electronics and Computer Science and released under a GPL license. EPrints is the most flexible platform for building high quality, high value repositories, recognised as the easiest and fastest way to set up repositories of research literature, scientific data, student theses, project reports, multimedia artefacts, teaching materials, scholarly collections, digitised records, exhibitions and performances. The EPrints software is not to be confused with "eprints" (or "e-prints"), which are preprints (before peer review) and postprints (after peer review), of research journal articles: "eprints" = preprints + postprints. 1.5.3.1 History of E prints EPrints was created in 2000 as a direct outcome of the 1999 Santa Fe meeting that launched what eventually became the OAI-PMH. The EPrints software was enthusiastically received, became the first and one of the most widely used free open access, institutional repository software, and it has since inspired the development of other software that fulfill a similar purpose. Version 3 of the software was officially released on the 24th January 2007 at the Open Repositories 2007 Conference and was described by its developers as "a major leap 10 forward in functionality, giving even more control and flexibility to repository managers, depositors, researchers and technical administrators." 1.5.3.2 Technology EPrints is a Web and command-line application based on the LAMP architecture (but is written in Perl rather than PHP). It has been successfully run under Linux, Solaris and Mac OS X [2]. A version for Microsoft Windows is being developed but will be released under a non-GPL license. Version 3 of the software introduced a (Perl-based) plugin architecture for importing and exporting data, converting objects (for indexing) and user interface widgets. Configuring an EPrints repository involves modifying configuration files written in Perl or XML. Web based configuration tools are in development. The appearance of a repository is controlled by HTML templates, stylesheets and inline images. While Eprints is shipped with an English translation it has been translated to other languages through (redistributable) language-specific XML phrase files. Existing translations include Bulgarian, French, German, Hungarian, Italian, Japanese, Russian, Spanish and Ukrainian. Try out the new features of EPrints 3 online with our demonstration repository or see a slideshow PDF of the features of the software.

Repository Lower the barrier for your depositors while improving metadata quality managers and the value of your collection.

Depositors Time saving deposits Import data from other repositories and services Autocomplete-as-you-type for fast data entry

Researchers Optimised for Google Scholar Works with bibliography managers Works with desktop applications and new Web 2.0 services RSS feeds and email alerts keep you up to date

Developers Tightly-managed, quality-controlled code framework Flexible plugin architecture for developing extensions

Webmasters Easily integrate reports, bibliographic listings, author CVs and RSS feeds into your corporate web presence 11

Institutions High specification repository platform for high visibility, high quality institutional open access collections Conform with research funder open access mandates

1.6 Elements of digital libraries  Audio Visual: LCDTV, Tape Recorder with headphone, D.V.D, Telephone etc.  Computer: Server, P.C with multimedia, U.P.S etc  Network: LAN,MAN,WAN, Internet etc  Printer: Laser printer, Dot Matrix, Barcode printer, digital graphic printer, etc  Scanner: H.P. Scan jet, flatbed, sheet feeder, drum scanner, slide scanner, microfilming scanner, digital camera, barcode scanner, etc. 1.7 Functions of Digital Libraries The important functions of digital Libraries are:  To enable one to perform searches that is not practical manually  To preserve unique collections through digitisation  To manage contents from multiple locations  To enable greater access to information  To provides means to enrich the teaching and learning environment and  To protect owners of information 1.8 Merits of Digital Libraries  No physical boundaries  Saving time  No time limit  Cost effective  Multiple accesses  Planned approach  Conservation and Preservation  Retrieval of Information  Storage Space  Computer Networking 1.9 Demerits of Digital Libraries  Copyright issues 12

 Computer Viruses  Access speed  Skills requirement  Initial cost is high  Health hazards  Band width  Efficiency and Preservation  Manipulation of data  Environment 1.10 Digital Library Initiatives in developed countries The growth and the popularity of the Internet and WWW resulted in two major DL initiatives being taken in the mid 1990's in the USA. First was the joint initiative of the National Science Foundation (NSF), Department of Defence Advanced Research Projects Agency (ARPA) and the National Aeronautics and Space Administration (NASA), in 1994, to fund six digital library development projects for a period of four years, among six academic institutions. The second was the signing of the National Digital Library Federation Agreement in May, 1995, led by the Library of Congress and 14 other research libraries. The purpose was to "bring together - from across the nation and beyond - digitized materials that will be made available to students, scholars and citizens everywhere". NSF/DARPA/NASA DL initiative targeted three research areas:  Capturing, categorizing, organizing information  Page, speech, video, graphics understanding  Indexing, hypermedia linking, knowledge representation  Searching, browsing, filtering, summarizing, visualization  Theories, models, intelligent processing, learning  Simulation, navigation, metaphors, optimization  Networking protocols and standards, using networked information  Security, know bots, compression, modelling, IPR 1.10.1 Technical developments Several technical developments have contributed to the interest in developing DLs:  Declining cost of digital storage (decreasing by about 30% per year)  In 1987, CD-ROM storage was cheaper than storing books in libraries  Storing information on computers is cheaper than storing physical equivalents 13

 PC display screens are more pleasant to use now - and improving further  More and more people are reading directly from computers (e.g. e-books)  High-speed networks are becoming widespread  Internet and intranets  DSL, DTH, ISDN, etc.  Computers have become portable  Laptops can be connected to the Internet almost from anywhere and DLs can be accessed  Laptops have become more powerful and cheaper  Sophisticated digitization technologies (capturing devices like scanners and conversion software)  Increasing availability of digital library software (commercial, open source free software) The major six projects undertaken in the DLI (1994-1998) are the following:  University of California at Berkeley: Environmental Planning Library and Geographic Information Systems  University of California at Santa Barbara: Alexandria Digital Library project; Spatially referenced Map Information  Carnegie Mellon University Project: Media Digital Video Library  University of Illinois at Urbana-Champaign: Federating Repositories of Scientific Literature  University of Michigan Digital Library Project (UMDL): Intelligent Agents for Information Location  Stanford University: Interoperation Mechanisms Among Heterogeneous Services The National Science Foundation (NSF) announced the Digital Libraries Initiative - Phase II in February 1998. In addition to the NSF, the Library of Congress, the Defense Advanced Research Projects Agency (DARPA), the National Library of Medicine (NLM), the National Aeronautics and Space Administration (NASA) and the National Endowment for the Humanities (NEH) are sponsoring the second phase of the Digital Libraries Initiative. If intent in the first phase was to concentrate on the investigation and development of underlying technologies, the second phase (1999-2004) is intended to look more at applying those technologies and others in real life library situations. Second phase aims at intensive 14 study of the architecture and usability issues of digital libraries including the vigorous research on:  Human-Centred DL architecture  Content and Collections-based DL architecture and  Systems-Centred DL architecture Further, Test beds are developed at following universities:  University of Arizona (High-Performance Digital Library Classification Systems: From Information Retrieval to Knowledge Management)  University of California Berkeley (Re-inventing Scholarly Information Dissemination and Use)  University of California Santa Barbara (Alexandria Digital Earth Prototype )  Carnegie Mellon University (Million books project)  Columbia University (A Patient Care Digital Library: and Summarization over Multimedia Information)  Harvard University (Operational Social Science Digital Data Library)  University of South Carolina (A Software and Data Library for Experiments, Simulations, and Archiving)  Stanford University (Stanford Digital Library Technologies Project)  Tufts University (The Perseus Digital Library Project)  Open Archive Initiative (Digital Library Federation, the Coalition for Networked Information, and from National Science Foundation)  Gutenberg Project  NCSTRL (Virginia Tech and Old Dominion University) This phase will try to link the application of digital libraries, especially, in facilitating teaching and learning processes. The domains of concentration will be science, mathematics, engineering and technology. The types of proposals of interest would be: practical digital library applications for SMET (Science, Mathematics, Engineering and Technology) education, technical studies of digital library capabilities, and general policy studies. One of the major outcomes of second phase is developments such the Open Archived and distributed architecture (4) and theorizing the digital library models (5). 1.11 Digital Library Initiatives in India The situation in India regarding Digital Libraries is very peculiar. Generally, the use of information Technology (IT) and Information and Communication Technology (ICT) in 15 libraries in India is concentrated in Universities, Indian Institutes of Technology (IITs) Indian Institute of Management (IIMs), Indian Institute of Science (IIScs), research institutes and some special libraries. Some government agencies, as well as public-sector institutions, are also engaged in digitization of libraries. But the initiatives taken by the government of India in this direction indicate that the potential of ICTs for developing digital Libraries has not been fully realized. While one government is providing support for one particular aspect, the other is focussing elsewhere, without any coordinated effort by a nodal agency. The concept of digital libraries in the developed countries started during the 1970s but in India it began in the mid 1990s with the advent of IT on a large scale and the support extended by the central government. The advent of the Internet acted as a catalyst for digital libraries initiatives. The concept was recognized in India during the Fifteen Annual Convention and Conference on Digital Libraries, organized by the Society of Information Science at Bangalore from 18 to 20 January 1996. A few libraries had made attempts in this direction earlier. Only sporadic and partial attempts have been made towards Digital Libraries initiatives. Simplistic approaches have been taken in the libraries, such as getting a few databases, on CD-ROM, subscribing to a few e-journals, scanning a few documents or creating Adobe Acrobat files and installing these on an intranet. But this scenario is changing at a snail’s pace and it has to gain momentum to survive in the competitive world. During the past five years, India has seen several Digital Library initiatives at the institutional, organizational and at national levels. Some of them are quite successful while others are making significant progress. Some of the major initiatives on Digital Libraries in India are furnished below:  Indian Institute of Science NCSI http://vidya-mapak.ncsi.iisc.ernet.in/cgi-bin/library)  Indian Institute of Management Kozhikode (http://intranet.iimk.ac.in/cgi-bin/library)  Search Digital Library SDL at DRTC Bangalore (https://drtc.isibang.ac.in/index.jsp)  Nalanda Digital library, National Institute of Technology (NIT) Calicut (http://www.nalanda.nitc.ac.in)  Vidyanidhi Project (http://www.vidyanidhi.org.in)  Million Book Universal Digital library Project-Carnegie Mellon-IISc-ERNET (http://www.dli.ernet.in)  Indira Gandhi Centre for the ARTS (IGNCA Digital Library) -(http://ignca.nic.in) 16

 INDEST, Ministry of HRD, GOI (http://paniit.iitd.ac.in/indest)  National Tuberculosis Institute (NTI), Bangalore (http://ntiindia.kar.nic.in/)  Rajiv Gandhi University of Health Sciences, Karnataka (RGUHS) (http://www.rguhs.ac.in/dl/index.html)  Traditional Knowledge Digital Library (TKDL) (http://203.200.90.6/tkdl/langdefault/common/home.asp)  Indian School of Business (http://www.isb.edu/lrc/index.html)  Indian Institute of Technology, Kharagpur (http://www.library.iitkgp.ernet.in/usr/elib/digital.htm)  Indian Institute of Technology, Mumbai (http://www.library.iitb.ac.in/~mnj/gsdl/cgi- bin/library)  IITMK Trivandrum (http://www.iiitmk.ac.in/iiitmk/digitallibrary.htm)  National Chemical Laboratory (NCL, CSIR): Digital Repository. (http://dspace.ncl.res.in)  University of Hyderabad (http://202.41.85.234:8000/cgi-bin/gw_42_6/chameleon The categorization of Digital Libraries in India may be listed as follows: 1.11.1 Initiatives at the government level Both the Union Government and the state governments of India have taken considerable initiatives towards the development of Digital Libraries. “Support of Government of India towards Digital Libraries initiatives-policy issues: the Long-Term national IT Policy” (National Task Force on IT and Software Development, 2003) shows us the commitment of the Government of India to provide information to users in digital form. The responsibility of envisioning, developing and sustaining functional hybrid and virtual library and information systems and services rests on the library and information profession. 1.11.2 Initiatives at national level institutions 1.11.2.1 Parliament Library A Digital Library has been set up in the computer centre to cater to the needs of members of parliament and officers and staff of Lok Sabha Secretariat. A large number of index-based databases of information generated within the Parliament which cater to the instant reference needs of members, officers and research and reference personnel were initially developed by the computer centre. The data stored and available now in PARLIS databases for online retrieval relates to: 17

 Parliamentary questions (full texts of questions and answers since February 2000; indexes from 1985 to 2000 are also available)  Parliamentary proceedings other than questions (full text of floor versions since the winter session of 1993; indexes from 1985 to 1993 are also available);  Government and private members’ bills from 1985 onwards (only indexes)  Directions, decisions and observations from the chair, from 1952 onwards  Presidents rule in the states and union territories, from 1951  Members of council of ministers from 1947 onwards  Obituary references made in the houses since provisional parliament.  Library management functions such as acquisition, processing and issue and return of books have also been computerized using the software package named “LIBSYS”. A web-based library catalogue can also be accessed through Internet;  Documentation service (from 1989 onwards) (important articles published in books, reports, periodicals and newspapers are indexed and annotated and can be accessed through Internet). 1.11.2.2 Initiatives at academic institutions of national importance 1.11.2.2.1 IIT, New Delhi The commitment to Digital initiatives and the emphasis upon Web-based digitised collections at the Central Library, IIT Delhi commenced in 1998 with the installation of a fibre optics-based campus LAN connected to a 2Mbps VSNL radio link enabling faster internet access for the academic community of the Institute. The availability of the high- speed Internet connection has led to the launching of a number of sponsored and un- sponsored projects for developing network-based digitized collections at the Central Library, IIT, Delhi. IITs are fortunate enough to receive generous grants projects from government bodies such as AICTE (All India Council of Technical Education) and the Ministry of Human Resources Development and Management (MHRD) to develop their Digital Libraries. A number of online coursewares have been developed. Digitization of old volume of journal at IIT Delhi is just example of projects supported by the government. 1.11.2.2.2 Indian Institute of Science A project proposal for NSF support under the Indo-US Science and Technology Collaboration initiative has been made by IISC. The IISc, Bangalore would act as a nodal 18 agency to coordinate amongst various academic institutions and governmental agencies from the Indian side. The Carnegie Mellon University would play the same role from the US side. The aim of the project is to digitize around a million books in the next three years. This joint initiative is planned to synergistically capitalize on the availability of the state of the art of hardware and software in the US for digitizing, storing and accessing of information and the quality personnel available in India. This would act as a forerunner for many such initiatives with other countries, particularly in China and Korea and would culminate in the grandiose vision of digitizing all the formal knowledge and make available in a location and time independent way for the benefits of the mankind. In order to take a million books to the Web, it is estimated that around 1000 man-years would be needed. If the project is carried out in a developed country like the United States of America, it would cost at least around 40M $ besides the cost of the hardware, space and energy. 1.11.2.2.3 National Institute of Technology, Calicut Nalanda, the Digital Library initiated in 1999 at the national Institute of Technology, Calicut, is one of the largest Digital Libraries in the country. Nalanda serves members of the campus in meeting their academic and research needs by providing timely and up-to date information with value added services in all the areas of science, engineering and technology. Part from the Digital Library reading room, members can access the Nalanda from the entire campus. 1.11.2.2.4 NISCAIR (formerly INSDOC) NISCAIR is slowly shifting to electronic libraries that will eventually lead to the establishment of Digital Libraries. With decreasing shelf space and ever growing collections in the libraries, NISCAIR has been advocating the conversion of automated libraries into electronic libraries. NISCAIR has access to international databases. Information is obtained through online searching from over 1500 international databases. Skilled personnel at NISCAIR perform searches for research scientists and the corporate sector that use these databases for the latest R & D, commercial and market information. National Science Library of NISCAIR has an Electronic Library Division with a rich collection of more than 5000 foreign journals, conference proceedings, etc. and a large number of databases on CD-ROMs. NISCAIR is the nodal agency for developing a consortium for CSIR laboratories for accessing e-journals. The activity shall range from creation to monitoring of the access facility of scientific periodicals published by leading international institutions. 1.11.2.2.5 National Tuberculosis Institute, Bangalore 19

On 28th October, 2003, the National Tuberculosis Institute, Bangalore, under the initiative and with the support of the Health Inter Network Project, India-TB, launched a Digital Library. This Digital Libraries was comprised of CDs on tuberculosis (TB), made available as ready reference tools for programme workers at the District and Primary Health Center Levels. The CDs on TB have relevant Revised National Tuberculosis Control Programme (RNTCP) documents and scientific literature on programme, treatment, drug resistance and control aspects of TB. 1.11.3 Digitization of art and culture Centre for Development of Advanced Computing (C-DAC) Digital library of art masterpieces. This is the first initiative of its kind in Asia and it will digitize 200 rare paintings of Rabindranath Tagore and Amrita Shergill from National Gallery of Modern Arts (NGMA). A digital library will be created (DLAS), developed by the Digital Library Group, to make the art accessible to a global audience via the World Wide Web. The infrastructure to host this Digital library would be located at the C-DAC, Bangalore. C-DAC and Hewlett Packard launched the joint initiative “when Art Meets Technology” for digital preservation, restoration and dissemination of art from the NGMA at Bangalore on February 04, 2003. 1.11.3.1 Indira Gandhi National Centre for the Arts (IGNCA) Kalasampada IGNCA has taken up the Kalasampada Digital Library-Resource for Indian Cultural Heritage (DL-RICH) project which is sponsored by MCIT. This project aims to use multimedia computer technology to develop a software package that integrates a variety of cultural information and helps the users to interact and explore the subjects available in image, audio, text, graphics, animation and video on a computer in non- linear mode, by a click of mouse. Kalasampada, a unique project of its kind, will facilitate the students, scholars, artists and the research and scientific community to access and view the materials. These materials include several hundred thousand manuscripts, over a hundred thousand slides, thousands or rare books, photographs, audio and video along with highly researched publications of IGNCA, all accessible from a single window. The system aims to be a digital repository of content and information with a user-friendly interface. The knowledge base created will help the scholars to explore and visualize the information stored in multiple layers. This will provide a new dimension in the study of the Indian art and culture, in an integrated way, while giving due importance to each medium. 1.11.4 Initiatives within society level organizations 1.11.4.1 Mobile Digital Library (Dware Gyan Sampada) 20

This is the product from C-DAC Noda. The mission of the project is an internet- enabled mobile Digital Library brought to common citizens with the purpose of spreading . C-DAC Noida (Department of IT, MCIT) contributes to bringing digitized books to the doorsteps of common citizens. It makes use of a mobile van with a satellite connection for connectivity to the internet. The van is fitted with a printer, scorer, cutter and binding machine for providing bound books to the end user. Different places, such as schools in villages and other remote areas, will be covered under this programme to promote literacy and demonstrate the use of technology for the masses. The schedule of visits of the mobile Digital Library is made available on their website. Books formatted for boo printing may be selected from the website by language, author and title. There are about 350 books in Hindi and English which will be available for download through this website. The site is bilingual (English and Hindi). 1.11.5 Indian Digital Library Policy The National Task Force on IT and software Development (2003) has given some valuable recommendations for development of digital library in the country. These recommendations have been covered in the report under the IT Action Plan (Part III) for the content creation and content industry. The salient features of the recommendations are listed below:  A pilot project on Digital Library development, based on indigenous software, will be initiated. The project will be time-bound and implemented at one of the suitable existing libraries to serve as a model. The software so developed can be distributed to other organizations to accelerate the development of Digital Libraries in the country.  India is known for its rich and diverse cultural heritage. It also possesses a vast wealth of traditional knowledge. These are mostly in Indian languages and should be promoted and preserved for posterity. The government will therefore take initiatives through appropriate projects to create electronic images of information on Indian arts and culture for wider dissemination and research.  It will be mandatory for all the universities or deemed universities in the country to host every dissertation or thesis submitted for research degrees on a designated website.  National, regional and other public libraries will be required to develop database of their holdings that will be hosted on a designated website for free access to users. 21

 An effective copyright protection system is a prerequisite for development of creative works in electronic media. Therefore, the Indian copyright law should be strengthened in this direction. Further, there is a need for global harmonization of copyright laws. The conclusion of the trade-related aspects of intellectual property rights (TRIPSs) agreement and the two World Intellectual Property Organization (WIPO) treaties will be adopted for such harmonization. 1.12 Open Archives Initiative (OAI) Open Archives Initiative, is one of the international movement to solve this problem. It aims to develop and promote the use of a standard protocol, known as the Open Archives Metadata Harvesting Protocol (OAMHP). It is designed for better sharing and retrieval of e- prints residing in distributed archives. It also promotes interoperability standards that aim to facilitate the efficient dissemination of content. Open Archives Initiative is an attempt to build a “low-barrier interoperability framework” for archives containing a digital content. It allows people to harvest metadata from Data Providers. Open access scientific outputs are scattered across many disciplinary archives, institutional e-print archives, institutional repositories and open access journals. Therefore, it is difficult for scholars to locate all needed works on a particular subject. 1.12.1 Examples for OAI 1.12.1.1 OAIster OAIster currently provides access to 20,928,590 records from 1112 contributors. OAIster is a union catalog of digital resources provide access to these digital resources by "harvesting" their descriptive metadata (records) using OAI-PMH (the Open Archives Initiative Protocol for Metadata Harvesting).

22

1.12.1.2 Directory of Open Access Journals This service is free, full text, quality controlled scientific and scholarly journals and cover all subjects and languages. There are now 4108 journals in the directory. Currently 1483 journals are searchable at article level. 273952 articles are included in the DOAJ service.

1.12.1.3 CASSIR@Indian Institute of Science This service is a part of the ongoing project "Development of OAI-Based Institutional Research Repository Services in India", sponsored by Department of Scientific & Industrial Research, Ministry of Science & Technology, Government of India. This project is being carried out at National Centre for Science Information (NCSI), Indian Institute of Science (IISc), Bangalore. The service will harvest metadata as per the OAI-PMH protocol from the registered OA repositories in India. It also provides web-based search/browse functionality over the harvested metadata.

1.12.1.4 Indian Academy of Sciences 23

The Academy, founded in 1934, aims at promoting the progress and upholding the cause of science in pure and applied branches. Major activities include publication of scientific journals and special volumes, organizing meetings of the Fellowship and discussions on important topics, recognizing scientific talent, improvement of science education and taking up other issues of concern to the scientific community. The Academy's journals are 'open access' and full text is available as PDF files on each journal's website.  Current Science  Journal of Chemical Sciences  Proceedings - Mathematical Sciences  Journal of Earth System Science  Sadhana (proceedings in engineering sciences)  Pramana - Journal of Physics  Journal of Biosciencess  Bulletin of Materials Science  Journal of Astrophysics and Astronomy  Journal of Genetics  Resonance - Journal of Science Education

24

The Protocol for Metadata Harvesting, a tool developed through the Open Archives Initiative, facilitates interoperability between disparate and diverse collections of metadata through a relatively simple protocol based on common standards such as XML, HTTP, and Dublin Core. The Open Archives Initiative world is divided into data providers or repositories, which traditionally make their metadata available through the protocol and service providers or harvesters, who completely or selectively harvest metadata from data providers, again through the use of the protocol. Figure 1: OAI Harvesting Tools

OAI Harvesting Tools Service Provider Data Provider H R

A E R P V O Date Stamp E S S Identifier I T T Set Records E O R R Y

1.13 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) OAI-PMH is a protocol developed by the Open Archives Initiative. It is used to harvest (or collect) the metadata descriptions of the records in an archive so that services can be built using metadata from many archives. The protocol is usually just referred to as the OAI Protocol. OAI-PMH uses XML over HTTP. The current version is 2.0, updated in 2002. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) has been widely adopted since its initial release in 2001.Initially developed as a means to federate access to diverse e-print archives through metadata harvesting; the protocol has demonstrated its potential usefulness to a broad range of communities. According to the experimental OAI Registry at the University of Illinois Library at Urbana – Champaign there are currently over 25

300 active data providers using the production version (2.0) of the protocol from a wide variety of domains and institution types. 1.13.1 Key players in OAI-PMH There are two classes of participants in the OAI-PMH work: 1.13.1.1 Data providers: A data provider maintains one or more repositories (web servers) that support the OAI-PMH as a means of exposing metadata. These are the repositories which process the request and respond to service providers with appropriate OAI-PMH responses. They are creators and keepers of the metadata for objects (repositories) and archives of resources. 1.13.1.2 Service Providers: A service provider issues OAI-PMH requests to data providers and uses the metadata as a basis for building value-added services. They are harvesters of the metadata for the purpose of providing a service such as a search interface, peer-review system, etc. 1.14 Digital library development issues in India There are number of problems which the Digital Library development teams face in India while they embark on the digital library development as well as during the progress phase. Some of the prominent and predominant among them include the following: 1.14.1 Lack of proper Information & Communication Technology (ICT) Infrastructure  Digital Libraries demand cutting edge IT and Communication infrastructure such as:  High end and powerful Servers; Structured LAN with Broadband Intranet facilities, ideally optical fibre based Gigabit networks;  Required number of Workstations capable of providing online information services, computing and multimedia applications;  Internet connectivity with sufficient bandwidth, capable of meeting the informational and computational requirement of the user community;  There are many more related facilities / services which are highly essential in an ideal digital library environment. It is observed that the ICT infrastructures in most of the Institutions / Organizations, barring exceptions are not up to the desired level so as to run advanced digital library services to the optimum level. 1.14.2 Lack of Proper Planning and Integration of Information Resources Presently the library acquisitions in India are either paper based and electronic. In most of the libraries, paper based documents outnumber the electronic subscriptions and acquisitions. Some of the libraries need retro-conversion and digitization of library holdings 26 too. Literature on related studies show that there is a severe lapse on the libraries with regard to proper planning of their information resources which are conducive for developing digital libraries. Also, the electronic resources penetrate to the libraries in a multiplicity of complex formats and with different access terms and conditions. These information resources are scattered and distributed across a wide variety of publication types and a vast number of publishers. There is a dire need for proper planning and a meticulously framed content integration model which is achieved and implemented through world standard digital library technologies. 1.14.3 Rigidity in the publishers’ policies and data formats Having successfully installed and configured a digital library does not qualify a library to automatically populate all its digital collection into the digital library. One has to obtain publisher’s consent and copyright permissions for the same. Digital library softwares usually accept and process all popular and standard digital formats such as HTML, Word, RTF, PPT, or PDF. Most of the publishers put their materials in their own proprietary e-book reader formats, from which the text extraction becomes almost impossible. A vast majority of the scholarly content rests in journal literature and due to copyright issues they cannot be easily (almost impossible) find its way into the local repositories of the digital library. 1.14.4 Lack of ICT Strategies and Policies A vast majority of the libraries in India do not have laid down policies on ICT panning and strategies to meet the challenges posed by the technology push, the information overload, as well as the demand pull from the users. 1.14.5 Lack of Technical Skills The Human Resources available in the libraries need time-to-time professional enrichment inputs and rigorous training on the latest technologies which are playing around in the new information environment. The kinds of training programmes being imparted in India at the moment are not able to meet the demand in terms of quantity as well as quality. 1.14.6 Management Support For the provision of world class information systems, resources and services the libraries need the wholehearted support from the respective management. Institutional support in terms of proper funding, human resources and IT skills enrichment are prerequisites for the development and maintenance of state-of-art digital library systems and services. 1.14.7 Copyright / IPR Issues 27

Issues of copyright, intellectual property, and fair use concerns are posing unprecedented array of problems to the libraries and librarians are struggling to cope with all these related issues in the new digital information environment.

28

Unit II: Design and organization of digital Libraries: Architecture, interoperability, Protocols and Standards; User Interface.

2.1 Design of Digital Library Designing a digital library is not an all of a sudden process, but it involves various processes. Digital Library Design

Digital Server Print Server

Identification/Selection User Requirement Copy Right of documents

Scanning/Formatting

Data Conversion

Database Identification of Digital Docs.

of Digital Metadata/Links Archive

Library

User: Interface Development

2.2 Information architecture (IA) Information Architecture was originally a term with a meaning more akin to what is called today Information Design. The term "Information Architecture" was coined around 1975 by Richard Saul Wurman, an architect and AIGA member. The term was later appropriated by Web Design experts and applied to complex web sites, since Information 29

Architecture is an important aspect of Web User experience design. This appropriation has changed the original meaning into what is today considered to be Information Architecture. Information Architecture is the art of expressing a model or concept of information used in activities that require explicit details of complex systems. Among these activities are library systems, Content Management Systems, web development, user interactions, database development, programming, technical writing, enterprise architecture and critical system software design. Information architecture has somewhat different meanings in these different branches of IS or IT architecture. Most definitions have common qualities: a structural design of shared environments, methods of organizing and labelling websites, intranets and online communities, and ways of bringing the principles of design and architecture to the digital landscape. 2.2.1 Definitions According to R.I.P.O.S.E. technique, (1989) Information architecture is defined as the conceptual structure and logical organization of the intelligence of a person or group of people (organizations)”. In this case the term intelligence is used to the effect of "knowledge used to inform According to Information Architecture Institute, Information Architecture is defined as “the structural design of shared information environments, an emerging community of practice focused on bringing principles of design and architecture to the digital landscape, the art and science of organizing and labeling web sites, intranets, online communities, and software to support fundability and usability”. The term information architecture describes a specialized skill set which relates to the interpretation of information and expression of distinctions between signs and systems of signs. Information architecture, in the context of information system design, refers to the analysis and design of the data stored by information systems, concentrating on entities, their attributes and their interrelationships. It refers to the modeling of data for an individual database and to the corporate data models an enterprise uses to coordinate the definition of data in several (perhaps scores or hundreds) of distinct databases. The "canonical data model" is applied to integration technologies as a definition for specific data passed between the systems of an enterprise. At a higher level of abstraction it may also refer to the definition of data stores.

30

Four layered Digital Library Architecture:

User interfaces for their Services research tasks

Search ware that mediates

Tools access to digital information

Digital Repository Information Base

 Database

 Files

Decoupled layer that aid Digital Libraries, flexibility, sustainability & Interoperability, Protocol interoperability

The topmost layer is the Service layer which is responsible for providing information service to the end of users. This provides an user-friendly interface so that the user can select his required information by a click of a button. The second layer contains the software tools required for providing user interface, querying the digital repository, making links to related information and the providing the output a desired formats. The third layer is the backbone of the digital library which contains the digital information which is in the form of metadata contained in database and the digital information in the form of PDF files. The bottom most layer is an important one which provides the tools and protocols for interoperability. 2.3 Interoperability What is Interoperability? It is an ability to store and retrieve material across diverse content collections administrated independently. It brings an interrelationship between information service related disciplines like Library management, archives management, document management and resource management. Furthermore, Interoperability refers to the ability of two or more systems to interact with one another and exchange data according to a prescribed method in order to achieve predictable results 31

From Individual Archives to interoperability Interoperability among archives offers:  Benefit to scholars who use them  Common entry point for variety of resources  Discovery tool  Wider access  Better citation  Cross-archive value added services  Cross-archive search engines  Current awareness services  Linking systems Digital Library Interoperability Protocol-How it works?

Network Boundary Client Side Server Side Client Application Information Source

Interoperability Library Service

protocol Interface Proxy

Client Transport Module Interoperability protocol

Interface

CORBA Server transport Module HTTP etc 32

The key goals of the interoperability protocol are to make it very easy to build clients, and to construct Library Service Proxy (LSPs) that wrap arbitrary sources. Implementers of this protocol need to produce only the client application and/or the library service proxy. Everything else is taken care of by standard libraries. It is important to note here that client application and services need not be aware of the methods used for transporting operation request and replies. The transmissions of requests and replies might be accomplished through different ‘transport binding’: CORBA, HTTP, or may be in the future, some other means. Clients applications are unaffected by the transport binding. A client application merely creates a client transport module object in its local address space. This module implements the interoperability protocol interface. The client then involves the interoperability protocol operations on this local module. The modules pack the operations for transport via one of the supported Interoperability protocol transport bindings. Any given client transport module instance uses one particular transport binding, if a different binding is to be used, the client application simply instantiates a different class of transport module. 2.3.1 Different types of Interoperability 2.3.1.1 Technical Interoperability Technical interoperability is the most obvious and arguably the most straight forward aspect of interoperability. This is necessary to ensure that all the hardware and software components of networks and information system can physically communicate and transfer information successfully. In many ways, the most straightforward aspect of maintaining interoperability, consideration of technical issues includes ensuring an involvement in the continued development of communication, transport, storage and representation standards such as Z39.50, ISO-ILL, XML, etc. Work is required both to ensure that these individual standards move forward to the benefit of the community and to facilitate where possible their convergence, such that systems may effectively make use of more than one standards based approach. 2.3.1.2 Semantic Interoperability 33

Semantic interoperability refers to the meaning of information to its human users, as opposed to the simple physical transfer of data. Semantic interoperability presents a host of issues, all of which become more pronounced as individual resources each internally constructed in their own semantically consistent fashion are made available through ‘gateways’ or ‘union catalogues’. Almost inevitably, these discrete resources use different terms to describe similar concepts (for e.g. Author, Creator and Composer) or even use identical terms to mean very different things, introducing confusion and error into their use. Important aspects while crating semantic interoperability are:  Careful consideration of who the potential users of information systems are and the language that is necessary to communicate with them.  Agreement on the standard thesauruses and lists of terms to be used in metadata systems.  Consistent use of existing coding systems endorsed nationally and internationally. 2.3.1.3 Political/Human Interoperability Apart from issues related to the manner in which information is described and disseminated, the decision to make resources more widely available has implications for the organisations concerned (who may see this as a loss of control or ownership), their staff (who may not possess the skills required to support more complex systems and a newly distributed user community), and the end users. Process change, extensive staff and user training are rarely considered when deciding whether or not to release a given resource but are crucial to ensuring the effective long-term use of any service. While crossing organization to organization the interoperability has to be discussed and agreement signed properly. The participating organizations have to establish a discussion forum to facilitate sharing of information in interoperability and related issues. 2.3.1.4 Inter-community Interoperability As traditional boundaries between institutions and disciplines begin to blur, researchers increasingly require access to information from a wide range of sources, both within and without their own subject areas. Complementing work in the library sector, important initiatives are also underway in related information providing communities such as museums and archives. In many cases, both goals and problems are similar, and there is much to be gained through adopting common solutions wherever feasible. Many factors are contributing to the blurring of boundaries between communities. Digital Libraries between different communities have to be specified clearly while crossing the community’s access and 34 limitations. As a result, it is increasingly important that information systems can interoperate across these boundaries. In the area of resource discovery, one of the main mechanisms for facilitating this interoperability is the metadata Standard and harvesting system which provides for consolidated resource discovery across all inter communities. 2.3.1.5 Legal Interoperability While the Internet makes it easy to physically publish and access information, there are many important legal aspects which constrain and influence how information can and should be made available and used. These include laws related to copyright, content regulation, privacy, freedom of information, telecommunications regulation, e-commerce and trade practices. 2.3.1.6 International Interoperability Each of the key issues identified, above, is magnified when considered on an international scale, where differences in technical approach, working practice and organisation have been in the world and make resources available to an international audience. However, this brings with it a need to ensure that interoperability issues are addressed in an international as well as particular country level. This introduces increased complexity to many of the above aspects, for e.g. on semantic interoperability and different legal jurisdictions and frameworks. It also highlights new aspects such as language differences and cross-cultural issues. 2.4 Protocols and standard 2.4 .1 Standards Standards are a set of rules or specifications for the design or operation of computing devices. There are proprietary standards, which are those developed and promulgated by companies in the hope of assuring or increasing their market. The open standards, which are published and freely available in the internet which can ba downloaded anyone. Either type may become a de facto standard, a set of rules or specification that comes into such widespread use in the marketplace that it becomes normative; or a de jure standard, a standard given in the endorsement of an official standards body such as the international organization for standardization (ISO). 2.4.2 Protocol Protocol is a set of rules for the exchange of information, such as those used for successful data transmission. When computers communicate with each other, there needs to be a common set of rules and instructions that each computer follows. A specific set of communication rules is called a protocol. Because of the many ways computers can 35 communicate with each other, there are many different protocols -- too many for the average person to remember. Some examples of these different protocols include PPP, TCP/IP, SLIP, HTTP, and FTP. 2.4.2.1 Typical properties in Protocols While protocols can vary greatly in purpose and sophistication, most specify one or more of the following properties:  Detection of the underlying physical connection (wired or wireless), or the existence of the other endpoint or node  Handshaking  Negotiation of various connection characteristics  How to start and end a message  How to format a message  What to do with corrupted or improperly formatted messages.  How to detect unexpected loss of the connection, and what to do next  Termination of the session and or connection. 2.4.2.2 Importance The widespread use and expansion of communications protocols is both a prerequisite for the Internet, and a major contributor to its power and success. The pair of Internet Protocol (or IP) and Transmission Control Protocol (or TCP) are the most important of these, and the term TCP/IP refers to a collection (or protocol suite) of its most used protocols. Most of the Internet's communication protocols are described in the RFC documents of the Internet Engineering Task Force (or IETF). The protocols in human communication are separate rules about appearance, speaking, listening and understanding. All these rules, also called protocols of conversation, represent different layers of communication. They work together to help people successfully communicate. The need for protocols also applies to network devices. Computers have no way of learning protocols, so network engineers have written rules for communication that must be strictly followed for successful host-to-host communication. These rules apply to different layers of sophistication such as which physical connections to use, how hosts listen, how to interrupt, how to communicate, what language to use and many others. These rules, or protocols, that work together to ensure successful communication are grouped into what is known as a protocol suite. 36

Object-oriented programming has extended the use of the term to include the programming protocols available for connections and communication between objects. Generally, only the simplest protocols are used alone. Most protocols, especially in the context of communications or networking, are layered together into protocol stacks where the various tasks listed above are divided among different protocols in the stack. Whereas the protocol stack denotes a specific combination of protocols that work together, a reference model is a software architecture that lists each layer and the services each should offer. The classic seven-layer reference model is the OSI model, which is used for conceptualizing protocol stacks and peer entities. This reference model also provides an opportunity to teach more general software engineering concepts like hiding, modularity, and delegation of tasks. This model has endured in spite of the demise of many of its protocols (and protocol stacks) originally sanctioned by the ISO. 2.4.2.3 Common protocols  IP (Internet Protocol)  UDP (User Datagram Protocol)  TCP (Transmission Control Protocol)  DHCP (Dynamic Host Configuration Protocol)  HTTP (Hypertext Transfer Protocol)  FTP (File Transfer Protocol)  Telnet (Telnet Remote Protocol)  SSH (Secure Shell Remote Protocol)  POP3 (Post Office Protocol 3)  SMTP (Simple Mail Transfer Protocol)  IMAP (Internet Message Access Protocol)  SOAP (Simple Object Access Protocol)  PPP (Point-to-Point Protocol) In information technology, a protocol (from the Greek protocollon, which was a leaf of paper glued to a manuscript volume, describing its contents) is the special set of rules that end points in a telecommunication connection use when they communicate. Protocols exist at several levels in a telecommunication connection. For example, there are protocols for the data interchange at the hardware device level and protocols for data interchange at the application program level. In the standard model known as Open Systems Interconnection (OSI), there are one or more protocols at each layer in the telecommunication exchange that 37 both ends of the exchange must recognize and observe. Protocols are often described in an industry or international standard. 2.4.2.3.1 TCP (Transmission Control Protocol) TCP (Transmission Control Protocol) is a set of rules (protocol) used along with the Internet Protocol (IP) to send data in the form of message units between computers over the Internet. TCP is known as a connection-oriented protocol, which means a connection is established and maintained until such time the message or messages to be exchanged by the application programs have been exchanged. TCP is responsible for ensuring that a message is divided into the packets that IP manages and for reassembling the packets back into the complete message at the other end. In the Open Systems Interconnection (OSI) communication model, TCP is in layer 4, the Transport Layer. For example, when an HTML file is sent to you from a Web server, the Transmission Control Protocol (TCP) program layer in that server divides the file into one or more packets, numbers the packets, and then forwards them individually to the IP program layer. Although each packet has the same destination IP address, it may get routed differently through the network. At the other end (the client program in your computer), TCP reassembles the individual packets and waits until they have arrived to forward them to you as a single file. 2.4.2.3.2 IP (Internet Protocol) The Internet Protocol (IP) is the method or protocol by which data is sent from one computer to another on the Internet. Each computer (known as a host) on the Internet has at least one IP address that uniquely identifies it from all other computers on the Internet. When you send or receive data (for example, an e-mail note or a Web page), the message gets divided into little chunks called packets. Each of these packets contains both the sender's Internet address and the receiver's address. Any packet is sent first to a gateway computer that understands a small part of the Internet. The gateway computer reads the destination address and forwards the packet to an adjacent gateway that in turn reads the destination address and so forth across the Internet until one gateway recognizes the packet as belonging to a computer within its immediate neighbourhood or domain. That gateway then forwards the packet directly to the computer whose address is specified. 2.4.2.3.3 HTTP (Hypertext Transfer Protocol) HTTP (Hypertext Transfer Protocol) is the set of rules for transferring files (text, graphic images, sound, video, and other multimedia files) on the World Wide Web. As soon as a Web user opens their Web browser, the user is indirectly making use of HTTP. HTTP is 38 an application protocol that runs on top of the TCP/IP suite of protocols (the foundation protocols for the Internet). HTTP concepts include (as the Hypertext part of the name implies) the idea that files can contain references to other files whose selection will elicit additional transfer requests. Any Web server machine contains, in addition to the Web page files it can serve, an HTTP daemon, a program that is designed to wait for HTTP requests and handle them when they arrive. Your Web browser is an HTTP client, sending requests to server machines. When the browser user enters file requests by either "opening" a Web file (typing in a Uniform Resource Locator or URL) or clicking on a hypertext link, the browser builds an HTTP request and sends it to the Internet Protocol address (IP address) indicated by the URL. The HTTP daemon in the destination server machine receives the request and sends back the requested file or files associated with the request. (A Web page often consists of more than one file.) 2.4.2.3.4 SMTP (Simple Mail Transfer Protocol) SMTP is a TCP/IP protocol used in sending and receiving e-mail. However, since it is limited in its ability to queue messages at the receiving end, it is usually used with one of two other protocols, POP3 or IMAP, that let the user save messages in a server mailbox and download them periodically from the server. In other words, users typically use a program that uses SMTP for sending e-mail and either POP3 or IMAP for receiving e-mail. On Unix- based systems, sendmail is the most widely-used SMTP server for e-mail. A commercial package, Sendmail, includes a POP3 server. Microsoft Exchange includes an SMTP server and can also be set up to include POP3 support. SMTP usually is implemented to operate over Internet port 25. An alternative to SMTP that is widely used in Europe is X.400. Many mail servers now support Extended Simple Mail Transfer Protocol (ESMTP), which allows multimedia files to be delivered as e-mail. In computing, the Post Office Protocol version 3 (POP3) is an application-layer Internet standard protocol used by local e-mail clients to retrieve e-mail from a remote server over a TCP/IP connection. POP3 and IMAP4 (Internet Message Access Protocol) are the two most prevalent Internet standard protocols for e-mail retrieval. Virtually all modern e-mail clients and servers support both.pop 3 communicates in squeak coding. This enables fast and effective communication. 2.4.2.3.5 Z39.50 Gateway Z39.50 is a national and international (ISO 23950) standard defining a protocol for computer-to-computer information retrieval. Z39.50 makes it possible for a user in one 39 system to search and retrieve information from other computer systems (that have also implemented Z39.50) without knowing the search syntax that is used by those other systems. Z39.50 was originally approved by the National Information Standards Organization (NISO) in 1988. The Z39.50 Maintenance Agency Page includes documentation and information related to the development and ongoing maintenance of the Z39.50 standard. Z39.50 Resources, maintained by Dan Brickley, Institute for Learning and Research Technology, and Z39.50 Resource Page, maintained by NISO, will provide hyperlinks to many Z39.50-related resources. Using a Z39.50 client, it is currently possible to search the Library of Congress bibliographic file. Information that will be required to configure your Z39.50 client to search the LC server directly is provided in LC Z39.50 Server Configuration Guidelines. It is also possible to access the LC server by using the appropriate search forms listed under "Search Library of Congress Catalog" above. This gateway makes use of an earlier version of the ISearch-CGI public domain software that was created by the Clearinghouse for Networked Information Discovery and Retrieval (CNIDR). It should be noted that many search and retrieval capabilities that are available in the Z39.50 protocol are not implemented in this gateway. The Initialization, Search, and Retrieval facilities have been implemented Wireless Application Protocol (commonly referred to as WAP) is an open international standard for application layer network communications in a wireless communication environment. Its main use is to enable access to the Internet (HTTP) from a mobile phone or PDA. A WAP browser provides all of the basic services of a computer based web browser but simplified to operate within the restrictions of a mobile phone, such as its smaller view screen. WAP sites are websites written in, or dynamically converted to, WML (Wireless Markup Language) and accessed via the WAP browser. Before the introduction of WAP, service providers had extremely limited opportunities to offer interactive data services. Interactive data applications are required to support now commonplace activities such as:  Email by mobile phone  Tracking of stock market prices  Sports results 40

 News headlines  Music downloads 2.5 User Interface A user interface is a system by which people (users) interact with a machine. The user interface includes hardware (physical) and software (logical) components. User interface exist for various system, and provides a means of:  Input, allowing the user to manipulate a system and  Output, allowing the system to indicate the effects of the user’s manipulation. 2.5.1 Tips and Techniques:  Consistency  Set standards and stick to them  Be prepared to hold the line  Explain the rules  Navigation between major user interface items is important  Navigation within a screen is important  Word your messages and labels effectively  Understand the UI widgets  Look at other applications with a grain of salt  Use colour appropriately  Follow the contrast rule  Align fields effectively  Expect your users to make mistakes  Justify data appropriately  Your design should be intuitable  Don’t create busy user interfaces  Group things effectively take an evolutionary approach 2.5.2 Types of User Interface  Graphical user interfaces (GUI) Graphical user interfaces accept input via devices such as computer keyboard and mouse and provide articulated graphical output on the computer monitor. There are at least two different principles widely used in GUI design: Object-oriented user interfaces (OOUIs) and application oriented interface.  Web based User Interfaces or Web User Interfaces (WUI) 41

Web user interfaces (WUI) accept input and provide output by generating web pages which are transmitted via the Internet and viewed by the user using a web browser program. Newer implementations utilize Java, AJAX, Adobe Flex, Microsoft NET or similar technologies to provide real-time control in a separate program, eliminating the need to refresh a traditional HTML based web browser. Administrative web interfaces for web- servers, servers and networked computers are often called Control panels.  Command line Interfaces Command line interfaces, where the user provides the input by typing a command string with the computer keyboard and the system provides output by printing text on the computer monitor. Used for system administration tasks etc.  Tactile Interfaces Tactile interface supplement or replace other forms of output with haptic feedback methods and used in computerized simulators etc.  Touch Interfaces Touch interfaces are graphical user interfaces using a touchscreen display as a combined input and output device. Used in many types of point of sale, industrial processes and machines, self-service machines etc.  Attentive user Interfaces Attentive user interfaces manage the user attention deciding when to interrupt the user, the kind of warnings, and the level of detail of the messages presented to the user.  Batch Interfaces Batch Interfaces are non-interactive user interfaces, where the user specifies all the details of the batch job in advance to batch processing, and receives the output when all the processing is done. The computer does not prompt for further input after the processing has started.  Conversational Interface Agents Conventional Interface Agents attempt to personify the computer interface in the form of an animated person, robot, or other character (such as Microsoft's Clippy the paperclip), and present interactions in a conversational form.  Crossing based Interfaces Crossing-based Interfaces are graphical user interfaces in which the primary task consists in crossing boundaries instead of pointing.  Gesture Interfaces 42

Gesture Interfaces are graphical user interfaces which accept input in a form of hand gestures, or mouse gestures sketched with a computer mouse or a stylus.  Intelligent User Interfaces Intelligent user Interfaces are human-machine interfaces that aim to improve the efficiency, effectiveness, and naturalness of human-machine interaction by representing, reasoning, and acting on models of the user, domain, task, discourse, and media (e.g., graphics, natural language, gesture).  Multi Screen Interfaces Multi-screen Interfaces employ multiple displays to provide a more flexible interaction. This is often employed in computer game interaction in both the commercial arcades and more recently the handheld markets.  Non Command User Interfaces No command user Interfaces observe the user to infer his / her needs and intentions, without requiring that he / she formulate explicit commands.  Reflexive User Interfaces Reflexive user interfaces where the users control and redefine the entire system via the user interface alone, for instance to change its command verbs. Typically this is only possible with very rich graphic user interfaces.  Tangible User Interfaces Tangible user interfaces place a greater emphasis on touch and physical environment or its element.  Text User Interfaces Text user Interfaces are user interfaces which output text, but accept other form of input in addition to or in place of typed command strings.  Voice User Interfaces Voice User Interfaces accept input and provide output by generating voice prompts. The user input is made by pressing keys or buttons, or responding verbally to the interface.  Natural Language Interfaces Natural Language Interfaces used for search engines and on webpages. User types in a question and waits for a response.  Zero Input Interfaces Zero Input Interfaces get inputs from a set of sensors instead of querying the user with input dialogs. 43

 Zooming user Interfaces Zooming User Interfaces are graphical user interfaces in which information objects are represented at different levels of scale and detail, and where the user can change the scale of the viewed area in order to show more detail. 2.5.3 Need for user Interface Some points are:  Human factors in interface design Limited short term memory.  When user make mistake and system go wrong, alarms and messages can increase stress.  People have a wide range of capabilities so designers should design for the users group and according to their capabilities  Users will have different interaction preferences. 2.5.4 Design principles for users interface design The user interface should be designed to match the skills; experience and expectations of its anticipated users they should be designed in such a way:  To explain different interactions styles and their use  To explain when to use graphical and textual information presentation  To explain the principal activities and approaches to system execution  System users often judge a system by its interface rather than its functionality.  A poorly designed interface can cause a user to make catastrophic

44

Unit III: Digital content creation: organization and management, files format. 3.1 Tips for Digital Content Creation Digital contents in a digital library may include a combination of structured/ unstructured text, numeric data, scanned images and other multimedia objects. These digital objects need to be organized and made accessible to the user community. As digital libraries are built around www and Internet Technology, it uses object and addressing protocols of the Internet.  Digital content means more than just words While words and images are key, a website is now really a collection of many powerful tools designed to interact in different ways with visitors. Think about all the digital possibilities available today including audio, video, graphics and animation.  Stay true to your message All the blogs or video in the world can’t help a confused message. When creating digital content, focus on that one key message, theme or storyline. And then tell it! White papers, press releases, blogs and podcasts are just some of the ways companies can offer up compelling content.  Words matter, particularly the right words Search engine optimization (SEO) is directly tied to the words you choose. By selecting appropriate long-tail keywords, you increase your chances of connecting the right people and information -- and rising above the traffic noise.  Stretch your creative side Mixing digital mediums can be incredibly fun, so let it go when working with images or interviewing that CEO for a podcast.  Make more connections Facebook, Twitter, LinkedIn and Google+ all these social media sites offer tremendous opportunities to connect.  Keep learning  The world of digital content creation is still young. With new software and tutorials springing up daily, you’ll never be short of ideas or resources that can help your project along.  Digital content creation can be an exciting journey, particularly when you discover a new way to tell your story through sound or images. 3:2 Digital Content Creation Services: 45

 Digital Content Creation offers the following services in support of the digitization of Library holdings:  Project planning and consultation  Full-book digitization  Digitization of maps, images, letters, archival material, slides, microfiche, microfilm, filmstrips, and three dimensional objects  Optical character recognition (OCR)  OCR correction  Color managed workflow  Image processing 3: 3 Equipment  Digital cameras and copystand  Large format Graphtec CS610 Pro scanner (for originals up to 42” wide)  Epson GT 15000 flat bed scanner 11 x 17 with sheet feeder  Plustek Opti Book A300 flat bed scanner 11 x 17 (book friendly)  Wilkes & Wilson Scan Station FS300 microfiche scanner  Nikon Super Cool Scan 4000 microfilm and slide scanner 3:4 Advantages of Digital Content  Students gain access to information traditionally found in textbooks at a lower price and in a more convenient format with an e-reader  Mobility matters: Students can access digital content on the go with a smart phone  Digital content spurs classroom collaboration through an interactive whiteboard  No more bulky backpacks! Digital content can be accessed through the traditional desktop, which every campus has. Digital Content Creation includes Meta Data, Dublin Core, CCF and MARC-21. 3.5 Metadata Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information.  Data about data – Digital speak for what librarians have been doing much before the Internet – surrogates, catalogs  A metadata record consists of a set of attributes, or elements, necessary to describe the resource in question 46

 Structured information  Describes, explains, locates an information resource  Makes it easier to retrieve, use or manage an information resour

Author: … M Title: …. Subject: … E Source type: … Format: … T ….

A Author: … Documents Audio Title: …. D Subject: … Source type: …

A Format: … Video Database …. Author: … T Title: …. Digital material Subject: … A Source type: … Format: … ….

3.5.1 Metadata Schemes and Element Sets Some of the most schemes are discussed below:  Dublin Core  Metadata Encoding and Transmission Standard (METS)  Metadata Object Description Schema (MODS)  Learning Object Metadata  The Encoded Archival Description (EAD)  MPEG Multimedia Metadata 3.5.1.1 Dublin Core The Dublin Core is a standard for cross-domain information resource description. It provides a simple and standardised set of conventions for describing things online in ways that make them easier to find. Dublin Core is widely used to describe digital materials such as video, sound, image, text, and composite media like web pages. Implementations of Dublin 47

Core typically make use of XML and are Resource Description Framework based. Dublin Core is defined by ISO in ISO Standard 15836, and NISO Standard Z39.85-2007. The Dublin Core consists of 15 elements sets:  Element Name: Title  Label: Title  Definition: A name given to the resource.  Comment: Typically, Title will be a name by which the resource is formally known.  Element Name: Creator  Label: Creator  Definition: An entity primarily responsible for making the content of the resource.  Comment: Examples of Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity.  Element Name: Subject  Label: Subject and Keywords  Definition: A topic of the content of the resource.  Comment: Typically, Subject will be expressed as keywords, key phrases, or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.  Element Name: Description  Label: Description  Definition: An account of the content of the resource.  Comment: Examples of Description include, but are not limited to, an abstract, table of contents, reference to a graphical representation of content, or free- text account of the content.  Element Name: Publisher  Label: Publisher  Definition: An entity responsible for making the resource available.  Comment: Examples of Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity. 48

 Element Name: Contributor  Label: Contributor  Definition: An entity responsible for making contributions to the content of the resource.  Comment: Examples of Contributor include a person, an organization, or a service.  Typically, the name of a Contributor should be used to indicate the entity  Element Name: Date  Label: Date  Definition: A date of an event in the lifecycle of the resource.  Comment: Typically, Date will be associated with the creation or availability of the resource. Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and includes (among others) dates of the form YYYY-MM-DD.  Element Name: Type  Label: Resource Type  Definition: The nature or genre of the content of the resource.  Comment: Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the DCMI Type vocabulary [DCT]). To describe the physical or digital manifestation of the resource, use the Format element.  Element Name: Format  Label: Format  Definition: The physical or digital manifestation of the resource.  Comment: Typically, Format will include the media-type or dimensions of the resource. Format may be used to identify the software, hardware, or other equipment needed to display or operate the resource. Examples of dimensions include size and duration. Recommended best practice is to select a value from a controlled vocabulary (for example, the list of Internet Media Types [MIME] defining computer media formats).  Element Name: Identifier  Label: Resource Identifier 49

 Definition: An unambiguous reference to the resource within a given context.  Comment: Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Formal identification systems include but are not limited to the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL)), the Digital Object Identifier (DOI), and the International Standard Book Number (ISBN).  Element Name: Source  Label: Source  Definition: A reference to a resource from which the present resource is derived.  Comment: The present resource may be derived from the Source resource in whole or in part. Recommended best practice is to identify the referenced resource by means of a string or number conforming to a formal identification system.  Element Name: Language  Label: Language  Definition: A language of the intellectual content of the resource.  Comment: Recommended best practice is to use RFC 3066 [RFC3066], which, in conjunction with ISO 639 [ISO639], defines two- and three-letter primary language tags with optional subtags. Examples include “en” or “eng” for English, “akk for Akkadian, and “en-GB” for English used in the United Kingdom.  Element Name: Relation  Label: Relation  Definition: A reference to a related resource.  Comment: Recommended best practice is to identify the referenced resource by means of a string or number conforming to a formal identification system.  Element Name: Coverage  Label: Coverage  Definition: The extent or scope of the content of the resource.  Comment: Typically, Coverage will include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range), or jurisdiction (such as a named administrative entity). Recommended best 50

practice is to select a value from a controlled vocabulary (for example, the Thesaurus of Geographic Names [TGN]) and to use, where appropriate, named places or time periods in preference to numeric identifiers such as sets of coordinates or date ranges.  Element Name: Rights  Label: Rights Management  Definition: Information about rights held in and over the resource.  Comment: Typically, Rights will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the Rights element is absent, no assumptions may be made about any rights held in or over the resource. 3.5.1.2 MARC 21 (Machine Readable Cataloguing) It defines a bibliographic data format that was developed by Henriette Avram at the Library of Congress beginning in the 1960s. It provides the protocol by which computers exchange, use, and interprets bibliographic information. Its data elements make up the foundation of most library catalogs used today. There are five MARC 21 communication formats,  Bibliographic Data,  Authority Data,  Holdings Data,  Classification Data, and  Community Information, Each MARC format provides detailed field descriptions and guidelines for applying the defined content designation and identifies conventions to be used to insure input consistency. 3.5.1.2.1 MARC Format  Authority Records Authority record provides information about individual names, subjects, and uniform titles. It also establishes an authorized form of each heading with references as appropriate from other forms of the heading.  Bibliographic Records Bibliographic record describes the intellectual and physical characteristics of bibliographic resources such as books, sound recordings, video recordings, and so on. 51

 Classification Records Classification records contain classification data. For example, the Library of Congress Classification has been encoded using the MARC 21 Classification format.  Community Information Records MARC records describing a service providing agency. For example, the local homeless shelter or tax assistance provider.  Holdings records Holdings records provide copy-specific information on a library resource such as call number, shelf location, volumes held, and so on. 3.5.1.2.2 Elements of a MARC Record A MARC record is composed of three elements:  Record structure refers to the way various elements in a record are identified. For example, different types of information are recorded in fields which are identified by three numeric characters called tags . Record structure is an implementation of the international standard Format for Information Exchange (ISO 2709) and its American counterpart, Bibliographic Information Interchange (ANSI/NISO Z39.2) and is described by the various MARC formats. Record structure is fully described in MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media.  Content designation refers to the codes and conventions established explicitly to identify and further characterize the data elements within a record and to support the manipulation of that data. Content designation is defined by each of the MARC formats.  Content of the data elements that comprise a MARC record is usually defined by standards outside the formats, such as the International Standard Bibliographic Description (ISBD ), Anglo-American Cataloguing Rules, Library of Congress Subject Headings (LCSH), or other cataloging rules, subject thesauri, and classification schedules used by the organization that creates a record. The content of certain coded data elements is defined in each of the MARC formats, e.g., the Leader, field 008. 3.5.1.2.3 Components of a MARC Record A MARC record consists of three main components:  Leader: 52

Data elements that provide information for the processing of the record. The data elements contain numbers or coded values and are identified by relative character position. The Leader is fixed in length at 24 character positions and is the first field of a MARC record.  Director A series of entries that contain the tag , length , and starting location of each variable field within a record. Each entry is 12 character positions in length. Directory entries for variable control fields appear first, sequenced by tag in increasing numerical order. Entries for variable data fields follow, arranged in ascending order according to the first character of the tag. The stored sequence of the variable data fields in a record does not necessarily correspond to the order of the corresponding Directory entries. Duplicate tags are distinguished only by the location of the respective fields within the record. The Directory ends with a field terminator character.  Variable Fields The data in a MARC record is organized into variable fields, each identified by a three- character numeric tag that is stored in the Directory entry for the field. Each field ends with a field terminator character. The last variable field in a record ends with both a field terminator and a record terminator. There are two types of variable fields:  Variable control fields: The 00X fields. These fields are identified by a field tag in the Directory but they contain neither indicator positions nor subfield codes. The variable control fields are structurally different from the variable data fields. They may contain either a single data element or a series of fixed-length data elements identified by relative character position.  Variable data fields: The remaining variable fields defined in the format (0XX-9XX). In addition to being identified by a field tag in the Directory, variable data fields contain two indicator positions stored at the beginning of each field and a two-character subfield code preceding each data element within the field. (See Content Designators for additional information on indicator positions and subfield codes.) The variable data fields are grouped into blocks according to the first character of the tag, which with some exceptions identifies the function of the data within the record. The type of information in the field is identified by the remainder of the tag. For example, in all five 53 formats, the 0XX block contains Number and Code fields for control information, identification and classification numbers, etc. Within blocks in some of the formats, certain parallels of content designation are preserved. Meanings, with some exceptions, are given to the final two characters of the tag of fields:

X00 Personal names X40 Bibliographic titles

X10 Corporate names X50 Topical terms

X11 Meeting names X51 Geographic names

X30 Uniform titles X55 Genre/form terms

MARC 21 is a result of the combination of the United States and Canadian MARC formats (USMARC and CAN/MARC). MARC21 is based on the ANSI standard Z39.2, which allows users of different software products to communicate with each other and to exchange data. MARC 21 was designed to redefine the original MARC record format for the 21st century and to make it more accessible to the international community. MARC 21 has formats for the following five types of data: Bibliographic Format, Authority Format, Holdings Format, Community Format, and Classification Data Format. Currently MARC 21 has been implemented successfully by The British Library, the European Institutions and the major library institutions in the United States, and Canada. MARC 21 allows the use of two character sets, either MARC-8 or Unicode encoded as UTF- 8. MARC-8 is based on ISO 2022 and allows the use of Hebrew, Cyrillic, Arabic, Greek, and East Asian scripts. MARC 21 in UTF-8 format allows all the languages supported by Unicode. 3.6 File Format File Formats can be grouped into following types 3.6.1 Format for Text or alpha-numeric data 3.6.1.1. ASCII Text (*.txt) This is the most basic file format used to transfer data on the Internet. The ASCII (American Standard Cod for Information Interchange) text formats tends to be the most portable format because it is supported by almost all application on most platforms. ASCII, or Plain, text file contain data made up of ASCII characters: each in the files contains one character that conforms to the standard ASCII character set. Text editors such as Windows 54

Notepad and DOS Editor generate ASCII text as their native file format. Examples of ASCII text files include program source code, batch macros and scripts. Strengths: ASCII text is the lowest common denominator in file formats and is almost universally supported across applications and platforms. Weakness: Very limited in terms of formatting and multimedia support. Problems can occur in rendering the text hen transferring files between computers which use different coded character sets. For examples, the US ASCII (the 7-bit version) character set is limited to 128 representations of characters, whereas more than 250 characters are needed to correctly represent European languages based on the Roman alphabet. Modern character sets define extensions to US ASCII specifying values above 127.representing special Latin characters and also characters from non-Latin writing systems such as Cyrillic and Hebrew. Extension Type Details Opens With .TXT TXT Standard MS-DOS text format MS Notepad, MS Wordpad, all text based programs .DOC TXT for Windows MS Wordpad, MS Works, MS text document Word, Word Perfect .RTF TXT Microsoft Rich Text Format MS Wordpad, MS works, MS Office .LOG TXT Log files Logview, MS Notepad, MS (i.e. error logs, chat logs) WordPad .XLS TXT Spreadsheet MS works, MS Excel .PDF TXT Adobe Acrobat files Adobe Acrobat Reader, Foxit Reader, or PDF Reader

3.6.2. Structured Markup Format 3.6.2.1. HTML (*.html, *.htm, *shtml) HTML (Hypertext Markup Language) is the main document format used for publishing data on the World Wide Web. It is an evolving standard derived from SGML where pages, are constructed with HTML tags embedded in the text to define the page layout, , tables and graphic elements, as well as hypertext links to other documents on the web. Although HTML was originally developed by W#C to be a common means of suggesting 55 how a document could be displayed (i.e. its structure rather than its appearance), many HTML pages now contain extensive, detailed formatting information which is specific to one type of browser. Strength: Wide availability of (free) applications which can create and render HTML documents. The ease of transfer of HTML documents over the Web using the Hyper Text Transfer Protocol (HTTP), combined with a standard means of locating files by the Universal Resource Locator (URL), mean that users have access to a powerful, interactive global information resources. Weakness: The continuing development of vendor-specific tags and the proliferation of the use of HTML as presentational, rather than semantic, markup are factors which potentially undermine the original concept of html. Both the above factors mean that HTML is slowly being turned into a presentational markup language for vendor-specific browsers such as Netscape Navigator and Microsoft Internet Explorer. Other weakness includes a lack of internal structure in HTML and the inability to define data input fields as required. 3.6.2.2. SGML (*.sgml) SGML (Standard Generalised Markup Language) is a standard (ISO- International Organization for Standardization-8879) generic markup language which describes the inter- relationship between a documents content and structure. The interrelationships are defined in a separate DTD (Document Type Definition) embedded within the document. SGML does not specify the format of a document; instead it specifies the tagging elements which are interpreted to format elements in different ways by different viewers. SGML, therefore, provides a means to allow the sharing and re-use of document based information across application and platforms in an open, vendor-neutral format. Strength: SGML is a rich and flexible language of great value in managing large documents which need to be frequently updated and capable of being printed in different formats. SGML provides a very stable means of information exchanging and is a reliable, industry-approved format for archiving. Weakness: 56

The fact that SGML is a very comprehensive language means that it is relatively complex and this has curtailed its adoption in many environments. Browser support is a potentially limiting factor, especially when compared to the availability of HTML browsers. 3.6.2.3. XML (*.) XML (Extensible Markup Language) is a pared-down version of SGML designed specifically for Web deployment. WML was developed by W3C to allow page designers to create customised tags based around the nature of the content of the document. This flexibility means that XML pages, when used with the appropriate XML processing application, can function in way similar to database records. Strengths: XML provides a simple means of creating new document types, thus allowing developers to escape the restrictions of HTML, while also avoiding the complexities of SGML syntax. XML incorporates a greater degree of functionality within documents than HTML, but retains the ease with which documents can be served and processed. Weakness: There is currently limited native browser support for XML, although the current generation of browsers can be used to view XML documents via plug-ins. 3.6.3. Page Description Format: 3.6.3.1. Post Script (*.ps) Postscript is a device-independent page description language (PDL) developed by Adope Systems in 1984 to describe the appearance of text and graphics on printed, or displayed pages. It is widely used on all computer platforms as a language for printing documents: a word processing application, for example, will translate its proprietary format into a Post Script programme which is then interpreted by the laser printer and outputted as a graphical mage. Encapsulated PostScript (EPS) is a scaled-down version of the full PostScript programming language used to exchange graphic images between applications in the PostScript format. Strength: Good for printed output. PostScript is the de facto standard in graphical output and is very widely used in the publishing industry because of its ability to maintain the image quality when used with high-resolution image setters. Files saved in Post script format can be easily printed by anyone with access to a laser printer or outputted through any other device with a Post Script interpreter. Very commonly used in the science and engineering communities. 57

Weakness: Viewing PostScript files on screen require a viewer such as Ghost View. To print correctly, the client computer must have access to the fonts specified in the Post Script file. Unfortunately these fonts tend to be copyrighted, which means that developers wishing to distribute files to a wide audience are constricted to using standard PostScript fonts in order to avoid possible copyright infringements. PostScript’s main weakness, however, is its tendency to generate large files unsuitable for document interchange over the web. 3.6.3.2. PDF (*,) PDF (Portable Document Format) is a physical markup language developed to by Adobe Systems to overcome some of the weaknesses of PostScript-both in terms of the efficient transfer of documents over networks and in terms of the need to include copyrighted fonts to ensure accurate reproduction. PDF makes it possible to send documents containing complex formats, multiple images and non standard fonts, to users who can then view or print the file exactly as intended by the document creator. The owner or creator of PDF documents can build-in a number of security features when saving the file. These features include assigning levels of password protection to the document to restrict access; together with a number of customisable options aimed at limiting a user’s ability to print and edits the saved document. Adobe developed several software applications to cope with PDF file, including Acrobat Reader for viewing and Acrobat Exchange for editing. Strengths: PDF is a device and resolution-independent means of efficiently transferring richly formatted documents between computers and across platforms. It is a well-established standard which has been widely adopted by publishers for on-line distribution of journals, manuals and books. Creators of PDF documents can incorporate various security features into the document to restrict access. Weakness: Viewing PDF files with common Web browsers require an external helper application or plug-in. PDF files are not easily reformatted or edited without the appropriate Adobe- produced tools. As a proprietary format PDF is vulnerable to the whims of the producing company. 3.6.4 Graphic formats: 3.6.4.1 BMP (*.bmp): Window bitmaps store a single faster image in any color depth, from black and white to 24bit color. The windows bitmap file format is compatible with other Microsoft Windows 58 programme. It does not support file compression and is not suitable for Web page. Overall, the disadvantages of this file format out weight the advantages. For photographic quality images, PNG, JPG or TIFF file is often more suitable. BMP file are suitable for wallpaper in Windows. Advantages:  1-bit through 24 bit color depth  Widely compatible with existing Windows programs, especially older programs  Disadvantages:  No compression, which results in very large files  Not supported by Web browsers 3.6.4.2. GIF (*.gif) Graphic Interchange Format is one of the most widely used graphics formats on the Web. The popularity of the bitmapped GIF is due mainly to two factors. Firstly, it is a truly hardware independent, cross-platform format and Secondly, GIF uses the powerful, lossless LZW (Lempel Ziv Welch) compression algorithm to optimise file sizes. LZW works by identifying and storing patterns of data within the image. These repeating patterns are then referred to via an index numbers in the compressed file. GIF supports colour depths from 1 to 8 bits (256colours) per pixel and resolutions up to 65, 536 by 65, 536 pixels. Strengths: An established, proven colour graphics standard for generating highly compressed raster images capable of being shared across different platforms. GIF is widely supported by a large number of viewing and editing programmes. GIF is an ideal graphics choice for drawings, cartoons, line art, icons and any image with large (horizontal) blocks of single colours. Weakness: GIF’s 8 bit colour limitation produces a noticeable degradation in the quality of some images, especially photo-realistic images and images containing smooth colour gradients. Possible future complications regarding the use of the LZW patented compression algorithm has hastened the search for an alternative format. 3.6.4.3. JPEG (*.jpg, *.jpeg, *.jif) JPEG is named after Joint photographic Experts Group, who developed the format for compressing true colour digital images. The name JPEG specifically refers to a method of 59 lossy compression rather than a particular file format- indeed many images known as ‘JPEG’ are actually ‘JFIF’ images after the JPEG File Interchange Format. The Experts Group took into consideration a large amount of research from fields such as human vision and computer graphics, before producing a detailed recommendation for a compression technique to reduce the size of images. The compression technique works by exploiting weakness in human visual perception (greater sensitivity to changes in brightness than in colour0 and discards graphic information which is effectively redundant in natural images. The result in the ability to reduce files sizes of digital image to as little as 1/25th of their original value, with little or no loss of perceived image quality. JPEG has been updated recently to allow a sequential transmission of image data, similar to interlaced GIFS, where images are overlaid to give a graphic of increasing quality. Such images are known as progressive JPEGs. Strengths: It is widely supported format in terms of viewing and editing applications. JPEG compression gives very high compression rates with high perceived image quality and handles high resolution images with colour depths of 24 bit and above. The newer progressive JPEG format allows for quicker initial display of images similar to GIF interlacing. Weakness: JPEG is a lossless compression method, .i.e. it discard information each time it is used to compress an image. This means that the image will usually have to be stored in an intermediate formal if it is likely to be further manipulated. There is no agreed standard for measuring JPEG compression and each editing application has developed its own compression or quality scale. 3.6.4.4. PNG (*.png) In order to overcome the two main weakness of the GIF a maximum colour depth of 256 colours and dependence on a patented compression method software-developer designed the PNG (Portable Network Graphics) format. The PNG Format has improved on GIF in a number of key areas including: increased colour depth up to 48-bit, inclusion of an alpha channel to give layer transparency, incorporation of gamma and colour correction for cross- platform consistency, two-dimensional interlacing, slightly improved compression and the ability to display and print at different resolutions. In order to limit the complexity of the compression algorithm, PNG’s developers decided not to include multiple image support (animation). In addition, PNG does not support lossy compression because its developers believed that JPEG was a satisfactory standard in that area. 60

Strengths: Support for true colour images and variable transparency. PNG provides an improved means of progressive display through two-dimensional interlacing and its lossless compression means that it is a good format choice for storing intermediate-stage images. PNG is patent and license-free and the royalty-free source code required to read and write PNG files is freely available. Weakness: Although PNG is a very powerful format which has been in existence for over 3 years, it is still not widely supported-especially by Web browsers. PNG has limited native support in Microsoft Internet Explorer 4+ and Netscape Navigator 4.04+. a number of third- party plug-ins are available to facilitate viewing through earlier versions of MSE and Navigator. 3.6.4.5 TIFF (*.tif, *.tiff) Tagged Image File Format is a file format for mainly storing images, including photographs and line art. Originally created by the company Aldus, jointly with Microsoft, for use with PostScript printing, TIFF is a popular format for high color depth images, along with JPEG and PNG. TIFF format is widely supported by image manipulation applications such as Photoshop by Adobe, GIMP, Ulead Photo Impact and Paint Shop Pro by Jasc, by desktop publishing and page layout applications, such as QuarkXPress and Adobe InDesign and by scanning, faxing word processing, optical character recognition and other applications. Dobe Systems, which acquired the Page Maker publishing program from Aldus, now controls the TIFF apesication. 3.6.5 Movie File Format 3.6.5.1 AVI (*.avi): AVI (Audio Video Interleave) is a format developed by Microsoft for its Windows platform. AVI interleaves audio and video data to provide 8bit colour depth animation of 160x120 pixel resolution and audio at 11,025Hz in 8bit samples. An AVI file plays on a PC via an application capable of the AVI file header and consecutively pulling in the video frame and accompanying audio. The video is then decompressed and displayed in sequence with the audio sample which has been sent to the soundcard for output. AVI is a specialisation of the RIFF (Resource Interchange File Format). Strength: AVI is the most popular format for audio/video data on the PC and is widely supported in the Windows platform. 61

Weaknesses: Although AVI can be played on other platforms, it is generally perceived as a Windows-only format and a proprietary Microsoft one at that. AVI is of relatively low quality with a limiting frame rate and mono sound. 3.6.5.2. MPEG (*.mpg) MPEG is named after the Motion Picture Experts Group, a committee organized in 1988 by the ISO to develop international standards and file formats for video compression. MPEG has produced a number of related standards n digital moving pictures formats, which have since been incorporated into a wide range of video compression hardware and software products. MPEG-1, the original video format used mainly in CD-ROM and Video CDs, provides an optimal resolution of 352x 240 pixels at 30 fps (frames per second) with 24 bit colour and CD-quality sound. MPEG-2 is capable of supporting broadcast-quality video over high-speed connection with superior resolution to VHS. This standard is used in DVD (Digital Video Disc) movies and supports resolutions of 720x480 and 1280x720 pixels at 60fps. An MPEG-3 (for HDTV specification was previously under development but its specifications were found to be covered by MPEG-2. MPEG-4 is a developing standard based on the Quick Time file format intended to support lower quality video over modem-speed (up to 56Kbs) data connections, e.g. video telephony. MPEG-2 is backwards-compatible with MPEG-1, i.e. MPEG-2 players are capable of playing MPEG-1 streams. MPEG incorporates the same infra coding as JPEG for each fame. This technique in combination with inter frame coding allows for optimal compression of video data. The inter frame compression uses a technique called DCT (Discrete Cosine Trnasform0 to encode only the changes between periodic key frames known as I-frames, instead of storing the entire contents of each frame. MPEG files can be decoded by special motherboard hardware or by software. Strength: MPEG provides a open, standard format for video (and audio) data compression capable of producing high-quality video at smaller files sizes than comparable video formats. The CPUs in most modern desktop PCs are easily capable of decompressing an MPEG-2 coded data stream. Weakness: Lossy intra frame compression in MPEG means that the data removed cannot be recovered. Native MPEG support is present only in the latest version of Web browsers. For 62 non-supported browsers, MPEGs have to be played through an external helper application or plug-in 3.6.5.3. Real Video (*.rv) Real Video is a proprietary video standard developed by the Real Networks software company. The format is a streaming technology, i.e. the compressed files can be played during download, rather than having to wait for the complete file to download before playing. Strength: Real Video is a well established format, with a large user base. The latest playing software allows the users to fine tune the video in real time and has support for third-party and standard data types such as AVI, WAV, MIDI, MPEG and JPEG. Weaknesses: Real Video files require Rea; Server software (a basic version is free) to handle the outgoing streams of packets and Real Player Software to receive and play the data on the client machine. Real Server is available in Unix and Windows, but not for the Macintosh platform. The latest release of the playing software, Real Player plus G2 is only available in Windows 95 and NT versions, although earlier versions are available on Windows and Macintosh. Real video also requires an encoder: Real Encoder. The Real Player tends to be memory-intensive and works best on high-spec machines. 3.6.6. Sound Format 3.6.6.1. Wav (Or Wave) Short for Waveform audio format is a Microsoft and IBM audio file format standard for storing audio on PCs. It is a variant of the RIFF bit stream format method for storing data in “chunks” and thus also closes to the IFF and the AIFF format used on Macintosh computers. Both WAVs and AIFFs are compatible with Windows and Macintosh operating systems. It takes into account some differences of the Intel CPU such as little-endian byte order. The RIFF format acts as a”Wrapper” for various audio compression codes. It is the main format used on Windows systems for raw audio. Though a WAV file can hold compressed audio, the most common WAV format contains uncompressed audio in the pulse-code modulation (PCM) format. PCM audio is the standard audio file format for CDs at 44,100 samples per second. Since PCM uses an uncompressed, losses storage method, which keeps all the samples of an audio track, professional users or audio experts may use the WAV format for maximum audio quality. WAV audio can also be edited and manipulated with relative ease using software. 3.6.7. MIDI Format 63

The MIDI (Musical Instrumental Digital Interface) is a format for sending music information between electronic music devices like synthesizes and PC sound cards. The MIDI format was developed in 1982 by the music industry. The MIDI format is very flexible and can be used for everything from very simple to real professional music making. MIDI files do not contain sampled sound, but a set of digital musical instructions (musical notes) that can be interpreted by your PCs sound card. The downside of MIDI is that it cannot record sounds (only notes) or to put it another way, it cannot store songs, only tunes.

64

Unit IV: Digital Resources management; Access to and Use of Digital Libraries; Storage, Archiving and Preserving Digital Collections. 4.1. Digital Resources In simple terms, E- resources are nothing but the traditional resource materials which are in the electronic or digital form. These resources are stored on magnetic or optical media such as, Floppy, CDs or DVDs. Significant proportions of these E-resources are available in machine readable format (as opposed to that of print), accessible only by computers.  E- Book E book is a book made available electronically for reading either on a normal computer or laptop or on a special hand held book readers.  E- Journals The journal available in electronic format, “ a remote access electronic serial is a continuing resources that is accessed via computer networks; it provides easy access, keyword search ability and accessibility just at publication time, independent of space and time access, interactivity and customization etc. EBSCO database.  E Database Engines provide the facility for research to E-Databases. Different types of library prepared off lines databases for providing services about information and books as OPAC. 4.2 Storage A digital library repository (DLR) stores the digital objects that constitute the library. The two key requirements that distinguish DLRs from other information stores are archival storage and intellectual property management. The archival nature of a DLR means that the digital objects (e.g., documents, technical reports, movies) must be preserved in definitely, as technologies and organizations evolve. Intellectual property management is required because digital objects will be served beyond the organization that runs the repository or that owns the information. There are two interrelated factors in the archiving of digital objects: data and meaning preservation. 4.2.1 Layered Architecture: Since each DLR site may be implemented differently, it is important to have well defined and as simple as possible site interfaces. Furthermore, it is also important to have clean interfaces for services within a site, so that different software systems could be used to implement individual components. The layers include: 65

 Object Store Layer: The Object Store layer uses a Data Store (e.g., file system, database management system) to persistently save objects. This layer may use its own scheme to identify objects (e.g., file names, tuple-ids).  Identity Layer: This layer has two main functions:  It provides access to objects via their handles (signatures) and  It provides basic facilities for reporting changes to its objects to other interested parties.  Complex objects layer: Manages collections of related objects. Its services could be used to maintain the different versions (or representations) of a document.  Reliability layer: Coordinates replication of objects to multiple stores, for long term archiving. The assumption is that the Object Store layer makes a reasonable effort at reliable storage, but it cannot be counted on to keep objects forever  Upper layers: Provide mechanisms for protecting intellectual property, enforcing security, and charging customers under various revenue models. It can also provide associative search for objects, based on metadata or contents of objects, as well as user access. 4.2.2 Object Store Layer The Object Storage Layer is the lowest DLR layer. This layer treats objects as sequence of and uses a local disk-ids to identify objects. The disk-ids are meaningful only to a specific Data Store and their format varies from data store to data store. For example, if the Data Store is a standard file system and each object is saved in a different file, the disk-id could be the file name. On the other hand, if all objects are saved in a single sequential file, then the disk-id could be the name of that file, the offset into that file, and the length of the object. 4.2.3 Object Store Interface The interface of the Object Storage Layer has the following functions.  OS _Get (disk _id): Read an object given its disk-id.  OS _Put (bag _of _bits): disk-id: Insert a new object in the repository and return the disk-id associated with it.  OS Awareness ( ): list _ of _ disk ids: List all disk-ids.  OS Awareness ( ), lets a client perform a “scan” of the entire collection. This is the most primitive type of awareness service one can envision. Its simplicity makes it 66

easier to implement an Object Store that is very robust. This awareness service is used by higher layers when they have lost their state, or when they wish to verify their state 4.2.5 Storing the processed document(s) into DVD Transferred digital information is handled by portable storage devices such as recordable Tapes, Floppies, CDs, etc. and more recently by DVDs. The capacity of the DVD is greater than other devices. Processed documents were stored into single layer single sided, 12cm in diameter with a capacity of 4.7GB DVD-R (mostly in Samsung-Pleomax and Sony). Each DVD contains the more than one document and the details of the stored documents are provided into the DVD by the CDAC. 4.3. Preserving Digital Collections Digital Preservation Digital preservation includes the preservation of print and non-print material in digitized form for effective, efficient and purposeful use. The purpose of preservation is to ensure protection of information of enduring value for access by present and future generations. Digital preservation is indeed a very challenging task for Library and Information Centre. The future of Library and Information services is closely associated to the preservation and the new technologies will create, collect, store, process and retrieve the information and deliver across the globe. Several issues of digital preservation including digital storage during the digitization process, migration of digital material, storage media are being faced to preserve the rare documents of the library. 4.3.1 Concept of Digital Preservation A process by which data is preserved in digital form in order to ensure usability, durability and intellectual integrity of the information contained therein is called digital preservation. Digital preservation comprises of planning; resource allocation and application of preservation methods and technologies necessary to ensure that digital information of continuing value remain accessible and usable. The main purpose is to ensure protection of information of enduring value for access by present and future generations. The term “digital preservation” refers to both preservation of materials that are created originally in digital form and never exist in print or analog form (also called “born digital” and “electronic records”) and the use of imaging and recording technologies to create digital surrogates of analog materials for access and preservation purposes. This means taking steps to ensure the longevity of electronic documents. It applies to documents that are either “born digital” or stored online (or on CDROM, diskettes, DVD, or other physical carriers) or to the products of analog to digital conversion, if long term access is intended. 67

4.3.2 Definition  Short Definition: Digital preservation combines policies, strategies and actions that ensure access to digital content over time.  Medium Definition: Digital preservation combines policies, strategies and actions to ensure access to reformatted and born digital content regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.  Long Definition: Digital preservation combines policies, strategies and actions to ensure the accurate rendering of authenticated content over time, regardless of the challenges of media failure and technological change. Digital preservation applies to both born digital and reformatted content. Encyclopedia of Information Technology “Digital preservation as the process of maintaining, in a condition suitable for use, materials produced in digital formats. Problems of physical preservation are compounded by the obsolescence of computer equipment, software, and storage media. Also refers to the practice of digitizing materials originally produced in non digital formats (print, film, etc.) to prevent permanent loss due to deterioration of the physical medium.” According to Hedstrom, “Digital preservation as the planning, resource allocation, and application of preservation methods and technologies necessary to ensure that digital information of continuing value remains accessible and useable”. “Digital preservation” or “Digital Archiving” means taking steps to ensure the longevity of electronic documents. It applies to documents that are either “born digital” or “stored online” (or on CD-ROM, diskettes or other physical carriers) or to the products of analog-to-digital conversion, if long term access is intended. 4.3.3 Steps for preserving electronic resources  Set up a team for preservation tasks.  Provide a suitable environment for storage and handling for physical format.  Identify electronic materials in the collection that require preservation.  Formulate a policy on priorities for preservation.  Create a metadata for long term access.  Examine technological changes in hardware and software. 4.3.4. Types of Digital Preservation: 68

When considering digital materials, there are three types of “preservation” one can refer to:  The preservation of the storage medium Tapes, hard drives, and floppy discs have a very short life span when considered in terms of obsolescence. The data on them can be refreshed, keeping the bits valid, but refreshing is only effective as long as the media are still current. The media used to store digital materials become obsolete in anywhere from two to five years before they are replaced by better technology. Over the long term, materials stored on older media could be lost because there will no longer have the hardware or software to read them. Thus, libraries will have to keep moving digital information from storage medium to storage medium.  The preservation of access to content This form of preservation involves preserving access to the content of documents, regardless of their format. While files can be moved from one physical storage medium to another, what happens when the formats (e.g., Adobe Acrobat PDF) containing the information become obsolete? This is a problem perhaps bigger than that of obsolete storage technologies. One solution is to do data migration¾ that is, translate data from one format to another preserving the ability of users to retrieve and display the information content. However, there are difficulties here too— data migration is costly, there are as yet no standards for data migration, and distortion or information loss is inevitably introduced every time data is migrated from format to format. The bottom line is that no one really knows how yet how to best migrate digital information. Because the preservation community is only beginning to address migration of complex digital objects and such migration remains “largely experimental.” Even if there were adequate technology available today, information will have to be migrated from format to format over many generations, passing a huge and costly responsibility to those who come after.  The preservation of fixed-media materials through digital technology This slant on the issue involves the use of digital technology as a replacement for current preservation media, such as microforms. Again, there are, as yet, no common standards for the use of digital media as a preservation medium and it is unclear whether digital media are as yet up to the task of long-term preservation. Digital preservation standards will be required to consistently store and share materials preserved digitally what can libraries jointly do in a coordinated scheme? They can:  create policies for long-term preservation 69

 ensure that redundant permanent copies are stored at designated institutions  help establish preservation standards to consistently store and share materials preserved digitally.

Following are the steps of how to preserve the documents in digital format.  MARC 21 Format  MARC XML  MODS (Meta Data Object Description Standard)  MADS (Meta data Authority Description Standard)  EAD (Encoded Archival Description) 4.3.5 Principles of Preservation applied to Digital Preservation The basic principles of preservation that are being practiced for preservation of analogue media are also applicable to preservation in the digital world:  Longevity Information stored in digital format does not live forever because of fragility of digital works. There are replication adoptions and redundancy of hardware, software and data formats which implies that what is readable and interpretable today will be usable long into the future.  Selection Selection is here multistage process. Each stage has possible ways to go ahead with different options. Either it is a Selection of materials for digital preservation or selection of tools and technology or selection of media and formats. Each selection plays very important role in the success of preservation plan.  Quality The quality of digital content is required at three stages. First, during the preparation of the specification for workflow; second, when selecting and handling digital capturing; and third, at the delivery or access time to evaluate download time and user friendly formats. Consistency is the key to ensuring the quality of digital files. So it is necessary to develop a consistent series of processes to ensure that there are no variations in quality in regardless of different devices used for different stages and time.  Integrity Integrity is required to protect the access of digital content even we discard the original storage medium, software and hardware on which the digital content was created, maintained 70 and accessed. Preserving the digital integrity of digital content also involves developing techniques for verifying its alteration from original format.  Access Access to digital content is again major factor of consideration when we are putting valuable resources for online access. It is a policy matter of any library to give access to its digital contents.

Fig: Digital Preservation Planning

Digital Preservation Plan

Selection of Content

Outsourcing Digitizing Content In-house Digitization

Standardized Digital Contents

Selection of File Format

Back -up Storage Media Technological Up-gradation

Integration of Resources

Digital Library server

Intranet Access Internet Access

User Interface for Access

71

4.3.6 Digitization: It is the conversion of any fixed or analogue media¾ such as books, journal articles, photos, paintings, microforms¾ into electronic form through scanning, sampling, or in fact even re-keying. An obvious obstacle to digitization is that it is very expensive. How do you go about deciding what parts of a collection to digitize? There are several approaches available, at least theoretically:  Retrospective conversion of collections— essentially, starting at A and ending up a Z. However ideal such complete conversion would be, it is impractical or impossible technically, legally, and economically. This approach can arguably be dispensed with as a pipe dream.  Digitization of a particular special collection or a portion of one. A small collection of manageable size, and which is highly valued, is a prime candidate.  Highlight a diverse collection by digitizing particularly good examples of some collection strength  High-use materials, making those materials that are in most demand more accessible.  An adhoc approach where one digitizes and stores materials as they are requested. This is, however, a haphazard method of digital collection building. These approaches can be used alone or in combination depending upon a particular institution’s goals for digitization. Nested within these approaches are several criteria for selecting individual items. These include:  Their potential for long-term use  Their intellectual or cultural value  Whether they provide greater access than possible with original materials (e.g., fragile, rare materials) and  Whether copyright restrictions or licensing will permit conversion. 4.3.7 Using the IR for Digital Collections  Standardized way for searching and presenting information  Interface issue becomes less important if IT is capable of exposing metadata and discriminating information in different ways  Cross-collection searching  Advanced searching  Centralized storage and delivery  Consistent and reliable preservation policies and procedures 72

 Streamline (and advance) staff skills 4.3.8 Common Elements of IR Preservation Policies  Contents must support teaching, learning, and research  Data must be standards-based, non-proprietary formats and, if not, be suitable for conversion to lesser version (“desiccated data”)  Repository maintains responsibility for checking and objects and versions in perpetuity  Submissions policies as well as who can be an authorized depositor vary; in general content is not removed unless there is a copyright infringement; removed data may be marked by a “tombstone”  Accepted content is generally completed scholarship 4.3.9. Methods of preservation of digital materials:  Refreshing: In this method data is copied periodically from one media to another to ensure its longevity. An example of refreshing is copying a group of files from floppy disks to CD-ROM.  Digital Archeology: Digital archaeology rescues content from damaged media or damaged hardware or software.  An emergency recovery strategy involves specialized techniques to recover data from unreadable media, either due to physical damage or hardware failure.  Carried out by data recovery companies.  Given enough resources, readable bit-streams can often be recovered even from heavily damaged media (especially magnetic media).  Migration: Migration means to copy data, or convert data, from one technology to another, whether hardware or software, preserving the essential characteristics of the data. The purpose of migration is to preserve the integrity of digital objects and to retain the ability for clients to retrieve, display and otherwise use them in the face of constantly changing technology.  Emulation:  Emulation uses a special type of software called an emulator, to translate instructions from original software to execute on new platform. 73

 Emulation requires the creation of emulators, programs that translate code and instructions from one computing environment so it can be properly executed in another.  Replication  Replication is used to represent multiple digital preservation strategies.  Bit stream copying is a form of replication  Objective is to enhance the longevity of digital documents while maintaining their authenticity and integrity through copying the use of multiple storage locations.  Encapsulation  Technique of grouping together a digital object and metadata necessary to provide access to that object.  The grouping process lessens the possibility that any critical component necessary to decode and render a digital object will be lost.  Appropriate types of metadata to encapsulate with a digital object include reference, representation, provenance, fixity, and context information. Encapsulation is considered a key element of emulation. 4.4. Archiving 4.4.1. Issues in archiving and provision of access Some issues involved with archiving of online publishing are unique to the format. Online electronic materials have no physical format to preserve. They can be removed from the internet or overwritten almost instantaneously at any time. Electronic publications are not the same as printed materials, even if the actual information contained in them is identical. They almost always have some important added-value features such as online links to similar documents, key word searching, browsing facilities, therefore it is important to preserve the information as well as these value added features. RELF has roughly divided these issues into three categories:  Technology issues  This is the area where the biggest challenges for the future of preserving and managing archives of electronic publication lie.  If libraries are to make serious efforts at archiving they need systems, which support and automate many of the processes. Only sophisticated, robust, large-scale systems will be able to automate the collection, storage, 74

management, preservation and provision of long term access to online publications.  Electronic publications need permanent naming, which allows unique and persistent identification of electronic publications for national bibliographic and resource discovery purposes.  Issues of Access  Digital Archives have an obligation to maintain the information in a form so that users over the network can find it with appropriate retrieval engines and view, print, listen to or otherwise use it with appropriate output devices. With respect to access, digital archives also have the responsibility to manage intellectual property rights by facilitating transactions between rights- holders in the information and users and by taking every reasonable precaution to prevent unauthorised use of the material GARRETT & WATERS). Copyright is often complicated with electronic publications as there are many creators contributing to one publication. In many electronic journals, creators of individual articles often retain copyright.  A right management system is needed to monitor access to digital publications, manage their use and access as required by the creator and deal with payment.  The integrity of an electronic resource must be protected by methods such as encryption or watermarking. A related area is validation of resources, commonly done by applying copyright and/or other metadata. This confirms which institution or person is providing the resources and who owns the intellectual property.  Metadata or structured data allows resources discovery and access but is also essential for preservation and administrative purposes.  Authentication is necessary to ensure the person accessing electronic resources is who they say they are. Most commonly a log in and password or a credit ca number verifies this. A related issue is authorization, which allows an authenticated user to access certain resources or services but not others. 4.4.2. Problems of archiving  Longevity: Documents published on the internet are volatile in nature, being subject to rapid change and unpredictable removal. In order to ensure 75

long term access to Internet publications there is need to identify items to be archived in a timely manner, so that they can be captured before they disappear.  Preservation of link: Hypertext links are a feature of online publishing that give online publication their special appeal. It is relatively easy to maintain links within an individual site such as links to footnotes or references. It is relatively difficult to maintain links to documents on other sites, most probably because of copyright and of identifying ownership.  Volume: one of the most daunting aspects of online publications is the huge quantity of them and the rapidly with which many of them grow and change. The challenge is to establish selection guidelines that will lead to the development of a collection that contains all of the materials that researchers of the future are likely to want to see. To manage the large volume of the internet publications it is necessary to collaborate with other collecting institutions, the state libraries, for.eg, to share the workload and avoid duplication. Co operative collection development policies should be established.  Identification: Identification means hours and hours of searching the internet each month, a very labour intensive. Widespread use of standard metadata coupled with improved search engines that would search on this metadata would assist the process of identification of titles.  Responsibility: Long term responsibility must be taken by those collecting institutions, which have the preservation of documentary heritage as their primary role. Co-operation with all stakeholders is essential.  Authentication: Unlike print publications or even physical format electronic publications, where content is fixed until a new or revised edition is published, internet publications can be changed or edited online by their owner without any warning. This may make it difficult to determine which is the correct version. It is also possible that malicious or accidental changes could also be made to online materials. Some form of authentication and checking will be necessary. Encryption, hashing, time stamping, watermarking and digital signatures are potential solutions. A lot more work is required before the need of preservation libraries for widely accepted and supported techniques for authenticating internet documents is met. 76

 Metadata: It is not sufficient just to create an archive of publications; users must be able to discover what is in it and to gain access to titles. Archive managers must be able to maintain the titles in the archive, to add new issues or versions of a title, to store and access software need to operate them, to compress them, to mange access rights and conditions to migrate them from proprietary or superseded formats and from obsolescent technologies. 4.4.3. Benefits of digital Archiving:  Shelf space conservation, a valued commodity in constant demand in the library  Improved access to reference data  Easy to cross reference or cross link data  Easy to use reference resource that includes online training tutorials.  Easy to upgrade resource  Standard format for spectral data  Greater searching capability via quick “searcher friendly” system  Intuitive user interface  Ability to incorporate laboratory data generated by students and faculty  Simplification of teaching and research.

77

Unit V: Web Technologies: An overview; Web Browsers and Service; Mark up Languages; Web Site Tools and techniques; Search Engines. 5:1 Introduction Web technology is the development of the mechanism that allows two of more computer devices to communicate over a network. For instance, in a typical office setting, a number of computers plus additional devices such as printers may be interconnected via a network, allowing for quick and convenient transmission of information. The processes involved in web technology are complex and diverse, which is why major businesses employ whole departments to deal with the issue. Web technology has revolutionized communication methods and has made operations far more efficient. Web Technologies involve the concept of a tier. A tier is nothing but a layer in an application. In the simplest form, the internet is a two tier application. They are the web browser and the web server. The technologies that exist in these tiers are as follows: Client Tie: HTML, Java Script, CSS Server Tier: Common Gateway Interface (CGI) Java Servlets, Java Server Pages (JSP), Apache Struts, Microsoft’s ASP.NET, PHP, etc.

Classification of Server side Web Technologies:

Server-side Web Technologies

Microsoft’s Server- Sun’s Server-side Web Other Server-side Web side Web Technologies Technologies Technologies

ASP.NET Java Servlets, Java Common Gateway

Server Pages (JSP), Interface (CGI) PHP

Struts,

78

5:1.1 Advantage of web Technologies  It offers convenience and a high speed of communication in the computer world. Whether in the office or the home, processes using a computer are more swift and straightforward with the use of a network.  It allows messages to be sent around a system, whereas before it may have been necessary to employ a runner or leave your workspace to communicate a message.  Web technology reduces costs and makes a company more efficient, raising business potential. 5:1.2 Disadvantage of web technologies  Matters involving web technology can be very complicated, and it would be difficult for someone without relevant experience to sort a network problem out. This means it is necessary to employ someone with the specific skills to solve network issues, which costs money.  Network security is another issue that must be considered when using web technology. Because weaknesses in a network could be exploited, important information could be stolen or destroyed and malware could infect the various network systems.  The existence of a network provides the opportunity for an attack on the computers system 5:2 Web Browser 5.2.1 Meaning of Web Browser A Web browser, also called a browser, is the program people use to access the World Wide Web. It interprets HTML code including text, images, hypertext links, Javascript and Java applets. After rendering the HTML code, the browser displays a nicely formatted page. The most popular Web browsers are Internet Explorer, Mozilla Firefox, Google Chrome, Opera and Apple Safari. All browsers are free and except for IE, which is Windows-only, they run on both Windows and Mac. Some browsers also run under Linux. 5.2.2 Basic Functions of Web Browser: The basic function of web browser is to retrieve a remote file from a web server and render it on the user’s computer. 79

 To locate the file on the web server, the browser needs a Uniform Resource Locater. The URL can be typed in by the user, or it can be a link within an HTML page, or it can be stored as a bookmark.  From the URL, the browser extracts the protocol. If it is HTTP, the browser then extracts from the URL the domain name of the computer on which the file is stored. The browser sends a single HTTP message, waits for the response, and closes the connection.  If all goes well, the response consists of a file and a MME type. To render the file on the user’s computer, the browser examines the MIME type and invokes the appropriate routines. These routines may be built into the browser or may be an external program invoked by the browser. All browsers offer similar features, no matter which computer they run on. The way users interact with a Web page has more to do with the page than the browser. Web pages contain embedded programs that turn them into applications not much different than the software users install in their own computers. 5.2.3 Extending Browsers beyond the Web Browsers were developed for the web, and every browser supports the webs basic protocols and few standard formats. However, browsers can be extended to provide other service while retaining the browser interface. Much of the success of browsers, of the web, and indeed of the whole Internet is due to this extensibility. Mosaic and its successors have had the same three types of extensibility one for data types, one for protocols and one for the execution of programs.  Data types  With each data type, browsers associate routines to render files of that type. A few types have been built into all recent browsers including plain text, HTML pages and images in GIF format, but users can add additional types through mechanism such as helper application and plug-ins.  A helper application is a separate program that is invoked by selected types of data. The source file is passed to the helper as data. For example, browsers do not have built-in support for files in the PostScript format, but many users have a PostScript viewer on their computer which is used as a helper application. When a browser receives a file of type PostScript, it starts this viewing program and passes to it the file to be displayed. 80

 A plug-in is similar to a helper application, except that is not a separate program. It is used to render source files of non-standard formats, within an HTML file, in a single display.  Protocols  HTTP is the central protocol of the web, but browsers also support other protocols. Some, such as Gopher and WAIS, were important historically because they allowed browsers to access older information services. Others, such as Net News, electronic mail and FTP remain important.  A weakness of most browsers is that the list of protocols supported is fixed and does not allow for expansion. Thus, there is no natural way to add Z39.50 or other protocols to browsers.  Execution of programs  An HTTP message sent from a browser can do more than retrieve a static file of information from a server. It can run a program on a server and return the results to the browser.  The earliest method of archiving this was the common gateway interface (CGI), which provides a simply way for a browser to execute a program on a remote computer. The CGI programs are often called CGI scripts. CGI is the mechanism that most web search programs use to send queries from a browser to the search system. Publishers store their collections in databases and uses CGI scripts to provide user access.  An informal interpretation of the URL http://www.dlib.org/chi- bin/seek?author=Arms is “On the computer with domain name www.dlib.org, execute the program in the file cgi-bin/seek, pass it the parameter string author Arms’ and return the output”. The program might search a database for records having the word arms in the author field.  The earliest uses of CGI were to connect browsers to older databases and other information. By a strange twist, now that the web has become a mature system, the roles have been reversed. People who develop advanced digital libraries often use CGI as a method to link the old system (the web) to their newer systems.

81

5.2.4. Parts of Web Browser Function Icons Minimize, Maximize, Close Butoon Menu Web Address Area

Search Engine

Keyword Search

Start Menu Task Bar Menu:

 These menus will allow you to perform various tasks, such as print, save a favourite web page and set your internet options. Function Icons:  Buttons that allow you to quickly perform tasks, such as print, refresh your web page,

82

go back to the last page viewed or go forward to the next page viewed. Minimize, Maximize, Close Buttons:  Allows you to make your browser smaller, larger or close your browser completely. Web Address Area:  Type the web site address (or name0 in this box to go to that web site page and click go or hit the enter key on your keyboard. Eg.www.act.org/discover/login Search Engine:  It is a large database containing information on millions of web sites, which allows you to enter keyboards to locate a site that offers the product or service that you seek. Keyword Search Area:  Type the keywords to describe the type of information that you want to find. Start Menu:  This menu will help you open various computer software programs that you want to use, such as Microsoft Word or Internet Explorer. Task Bar:  These lists all the programs that you currently have open. Click on one of the icons to maximize a particular program on your screen to continue working in that program. 5:2.1 Internal architecture of a web browser

Controller HTML Driver (1) interpreter (3) program

for

screen Optional I/O (8) interpreter (6)

HTTP Some Java

Client (2) interpreter (4) interpreter (5)

Network interface Card (NIC) (7) 83

A browser contains pieces of software that are mandatory and some that are optional depending upon the usage. HTTP program shown in the above figure as (2) and HTML interpreter program (3) are mandatory. Some other interpreter programs as in (4), Java interpreter program (5) and other optional interpreter program (6) are optional. The browser also has controller, shown as (1) which manage all of them. The controller is like the control unit in a computer’s CPU. It interprets both mouse clicks/selections and keyboard inputs. Based on these inputs, it calls the rest of the browser’s components to perform the specific tasks. For e.g. when a user types a URL, the controller calls the HTTP client program to fetch the requested Web page from a remote Web server whose address is given by the URL. When the web page is received, the controller calls the HTML interpreter to interpret the tags and display the Webpage on the screen. 5:3 Mark up language: The term Mark Up is derived from the publishing practice “marking up” a manuscript, which involves adding handwritten annotations in the form of conventional symbolic printer’s instruction in the margins and text of a paper manuscript or printed proof. For centuries, this task was done primarily by skilled typographers known as “mark up men”. “copy marker’s” who marked up text to indicate what , style and face should be applied to each part, and then passed the manuscript to others for typesetting by hand. Mark up was also commonly applied by editors, proofreaders, publishers and graphic designers and indeed by document authors. A mark up language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. Mark up is typically omitted from the version of the text which is displayed for end user consumption. Some ark up languages, like HTML, have presentation semantics, meaning their specification prescribes how the structured data is to be presented, but other mark up language, like XML, have no predefined semantics. A well-known example of a mark up language is widespread use today is Hyper Text Mark up Language (HTML), one of the document formats of the World Wide Web. HTML is mostly an instance of SGML and follows many of the mark up conventions used in the publishing industry in the communications of printed work between authors, editors and printers. 5:3.1 Types of Mark up Language: There are general categories of electronic mark up: Presentational, procedural and descriptive. 84

Presentational Mark up is that used by traditional word processing systems, binary codes embedded in document text that produced the WYSIWYG (What you see is what you get) effect. Such mark up is usually designed to be hidden from human users even those who are authors or editors. Procedural Mark up is embedded in text and provides instruction for programs that are to process the text well known examples includes troff, Late X and PostScript. Popular procedural mark up systems includes programming constructs, such that macros or subroutines can be defined and invoked by name. Descriptive Mark up is used to label parts of the document rather than to provide specific instruction as to how they should be processed. An example of descriptive mark up would be HTMLs tag, which is used to label a citation. 5:3.2 Four Languages  XML:  XML stands for Extensible Mark up Language much like HTML, structurally.  XML is based on both SGML and HTML  XML was designed to carry data, not to display data.  XML tags are not predefined.  XML is designed to be self-descriptive.  HTML:  HTML stands for Hypertext mark up Language and it is a plain text file which needs a simple text editor to create the tags.  It is platform independent which means HTML documents are portable from one computer system to another.  HTML is the most widely used mark up language for web-based documents.  HTML was designed to display data with focus on how data looks.  Elements in HTML consist of alphanumeric tokens within angular brackets such as , , , etc. Most elements consist of paired tags/: a start tag and an end tag. For example, is a start tag and is an end tag.  SGML:  SGML stands for Standard Generalized mark up Language.  SGML is a Meta language which gives rise to other mark up language. 85

 Programs based on SGML are very complex and expansive.  XHTML:  XHTML stands for Extensible Hypertext Mark up Language. 5.4 Search Engines Search engine is an information retrieval system based web site that helps users to retrieve any information from huge internet database. It is a kind of tool that crawls in the web according to user direction and record everywhere and everything user look for. The search engine software is a kind of information retrieval program; it has two major tasks such as Searching through the billions of terms recorded in the index to find matches to a search and ranking retrieved records in order to decide most relevant. 5.4.1 Components of Search Engine  (or Spider): Programs that traverses the web from link to link, identifying and reading pages. It works as a network surfer and it downloads a searched web site to local disk.  Web crawler is a kind of computer program that browses the Web in a methodical, automated way. This process is called as Web Crawling or spider. Search engines use spider to provide up-to-date information.  The most important aim of web crawler is copying all visited web pages for later searches to make next searches faster.  Web crawlers can also be used for automating maintenance task on a web site like checking links or validating code.  Also web crawlers are used to collect specific information from Web pages, they finds million of documents and helps to IR systems to retrieve correct information in easy way.  Also sometimes, crawler can find the information which is hidden by website owner or webmaster. Because of this, many web crawlers has to work according to robots exclusion protocol.  Web crawlers also may work as link checker, page change monitor, validator, file transfer protocol client or web browser 5.4.1.1 Index Web database containing a copy of each web page gathered by the spider. 5.4.1.2 Search Engine Mechanism 86

Software that enables users to query the index and that usually returns results in relevancy ranked order. 5.4.2 Types of search engines A search engine downloads all the information that the page contains and then examines that information to index keywords and phrases that can be used to categorise the sites. The exact method that it uses to do this and which information it looks at to create the index, varies according to the search engine. Those words and phrases are added to the database alongside the URL and a description of the site. There are three types of search engine. 5.4.2.1 Active Search Engine It collects web pages information by itself. It uses a program calls “Spider” or “Web Robot” to index and categorise Web pages as well as Web sites. The spider travels around the WWW in search of new sites and adds entries to their catalogue. 5.4.2.2 Passive Search Engine or Subject Directories Search Engines of this type are possibly more accurately referred to as directories. It does not seek out web pages by itself. They rely on the WWW users to submit details on their site or their favourite sites in order to build up a database. For example. Yahoo Directory (www.yahoo.com) has 14 main subject categories and each of these categories has many sub categories and those sub- categories also contain their own sub-categories. Hierarchically organized directories tend to be smaller than those of the search engines, which mean that result lists tend to be smaller as well. Because subject categories are arranged by category and because they usually return links to the top level of a Website rather than to individual pages, they lend themselves best to searching for information about a subject rather than for a specific piece of information. Due to the size of the web and constant transformation, keeping up with important sites in all subject areas is humanly impossible. Therefore, a guide by a subject specialist to important resources in his area of expertise is more likely than a general subject directory to produce relevant information and is usually more comprehensive than a general guide. These guides are known as Specialized Subject Directories. Such guides exist for virtually every topic. For examples:  Voice of shuttle (http://vos.ucsb.edu) provides an excellent starting point for humanities research. 87

 Film buffs should consider starting their search with the Internet MOVIE database (http://us.imdb.com). 5.4.2.3 Meta Search Engine An increasing number of search engine have led to the creation of “meta” search engine tools, often referred to as multi-threaded search engines. A Meta search engine does not catalogue any web pages by itself. It simultaneously searches multiple search engines. When a query is put before this type of search engines, it forwards that query to other search engines. There are two types of Meta search engines:  One type searches a number of engines and does not collate the results. This means one must look through a separate list of results from each engine that was searched. It may present the same result more than once. Some engines require the searcher to visit each site to view the results. While others will fetch the results back to their own sites. When the results are brought back to the site, a certain limitation is placed on what is allowed to be retrieved. With this type of Meta search engine, one can retrieve comprehensive and sometimes overwhelming results. An example of this type of engine is Dogpile.  The other type is more common and returns a single list of results, often with the duplicate hits removed. This type of Meta engine always brings the results back to its own site for viewing. In these cases, the engine retrieves a certain number of documents from the individual engines it has searched, cut off after a certain point as the search is processed. Other Meta search engines stop processing a query after a certain amount of time. Still others give the user a certain degree of control over the number of document returned in a search. All these factors have two implications  These Meta search engines returned only a portion of the documents available to be retrieved from the individual engines they have searched.  Results retrieved by these engines can be highly relevant, since they are usually grabbing the first item from the relevancy-ranked list of hits returned by the individual search engines. Some examples of Meta search engines are:  Metacrawler (www.metacrawler.com)  Surfwax (www.surfwax.com)  Zapmeta (www.zapmeta.com) 88

5.4.2.4 Engines Semantic Search engines are:  Semantic Web Search Engine (SWSE) Semantic Web Search Engine is a search engine for the RDF Web on the Web, and provides the equivalent services a search engine currently provides for the HTML Web. The system explores and indexes the Semantic Web and provides an easy-to-use interface through which users can find the information they are looking for. Because of the inherent semantics of RDF and other Semantic Web languages, the search and information retrieval capabilities of SWSE are potentially much more powerful than those of current search engines.  Sindice Sindice is a lookup index for Semantic Web documents built on data intensive cluster computing techniques. Sindice indexes the Semantic Web and can tell which sources mention a resource URI, IFP, or keyword, but it does not answer triple queries. Sindice currently indexes over 20 million RDF documents.  Watson Watson allows you to search through ontologies and semantic documents using keywords. At the moment, you can enter a set of keywords (e.g. "cat dog old lady"), and obtain a list of URIs of semantic documents in which the keywords appear as identifiers or in literals of classes, properties, and individuals.  Yahoo Micro search Micro search is Yahoo!'s stab at Semantic Web search and provides a richer search experience by combining traditional search results with metadata extracted from Web pages. Indexes RDF, RDF and Micro formats crawled from the Web. Micro search will soon be adding support for GRDDL.  Falcons Falcons is a keyword-based search engine for the Semantic Web, equipped with browsing capability. Falcons provides keyword-based search for URIs identifying objects and concepts (classes and properties) on the Semantic Web. Falcons also provides a summarization for each entity (object, class, property) for rapid understanding. Falcons currently indexes 7 million RDF documents and allows you to search through 34,566,728 objects. Developed by IWS China.  Semantic Web Search 89

Powered by RDF Gateway, Intelli dimension's proprietary platform for Semantic Web applications and agents. Developed by Intelli dimension Inc.  Zitgist Search The Zitgist Query Service simplifies the Semantic Data construction process with an end-user friendly interface. The user need not conceive of all relevant characteristics - appropriate options are presented based on the current shape of the query. Search results are displayed through an interface that enables further discovery of additional related data, information, and knowledge. Users describe characteristics of their search target, instead of relying entirely on content keywords.  Swoogle Searches through over 10,000 ontologies. 2.3 million RDF documents indexed, currently including those written in RDF/XML, N-Triples, N3(RDF) and some documents that embed RDF/XML fragments. Currently, it allows you to search through ontologies, instance data, and terms (i.e., URIs that have been defined as classes and properties). Not only that, it provides metadata for Semantic Web documents and supports browsing the Semantic Web. Swoogle also archives different versions of Semantic Web documents. Developed by the Ebiquity Group of UMBC.

http://swoogle.umbc.edu/

 Hakia

Hakia, which is a "meaning-based" search engine startup getting a bit of buzz. It is a venture-backed, multi-national team company headquartered in New York - and curiously has former US senator Bill Bradley as a board member. It launched its beta in early November this year, but already ranks around 33K on Alexa - which is impressive. 90

http://www.readwriteweb.com/archives/hakia_meaning-based_search.php

5.4.2.5 RSS Search Engines RSS takes a look at the specialized search tools that help you locate content in blogs, feeds and other sources of information. Many people mistakenly refer to RSS search as "blog search." While it's true that many blogs offer RSS feeds (automatic feed creation is a feature of most blogging software), not all blogs have feeds. Furthermore, RSS can literally be used with just about any kind of web-based content. RSS fundamentally is a relatively simple specification that uses XML to organize and format web-based content in a standard way.  Blogs Search Engines Blog Search is search technology focused on blogs. It is a strong believer in the self- publishing phenomenon represented by blogging. Blog Search helps users to explore the blogging universe more effectively and perhaps inspire many to join the revolution themselves.  Google Blog Search Search function is an excellent way to find blogs. By using keywords just as would for a standard and can sort the results by date.  Technorati (http://www.technorati.com) Technorati tracks over 100 million blogs and over 250 million pieces of tagged social media which means Technorati provides extremely comprehensive blog search results.  Sphere Sphere is a great blog search engine that provides users the opportunity to sort results by time and relevance and also provides links to see related content to your search. One of the best features is one that allows users to view their search history.  Ice Rocket (http//www.icerocket.com/?tab=blog): 91

Ice Rocket offers some very unique and helpful features. First, can enter your keywords then search within blogs, the web, MySpace, news or images. Second, can view the popularity of your keyword search using the Ice Rocket Blog Trends Tool.  Blog lines (http://www.bloglines.com) Blog lines is a blog search engine and a feed reader. It provides features that allow users to search and subscribe to news feeds and blogs. Users can search for posts, feeds or citations.  Blog pulse (http://www.blogpulse.com) Blog pulse offers a wide variety of tools to help users find blogs and information including Buzz-tracker, trends search, blogger profiles, conversation tracker and more.  Blog Catalog Blog Catalog is a social blog directory where anyone can search for information from blogs that have been submitted to the catalog. 5.4.2.6 E-Books Search Engines  Ebooks Engines - http://www.ebook-engine.com/  EBdb – Search Engine - http://www.ebdb.net/  Elibrary - http://e-library.net/  PDF eBook search engine – http://www.pdf-search-engine.com/  Ebooksbayt - http://www.ebooksbay.org/  Esnips Ebooks search engine - http://www.esnips.com/web/ebooksearchengine  Fizziebooks - http://www.fizziebooks.com/ 5.4.2.7 E-Journals Search engines  EEVL E-Jounal Search Engine (EESE) http://www.intute.ac.uk/sciences/ejournals.html  Scirus - http://www.scirus.com/  OJOSE -http://www.ojose.com/ 5.4.2.7.1 EEVL E-Jounal Search Engine (EESE) Search the content of over 350 freely available full-text science, engineering and technology ejournals, selected for relevance and quality. Academic journals, trade publications, newsletters, and society journals are covered. All sites are also listed in the Intute: Science, Engineering and Technology catalogue of Internet resources. 92

http://www.intute.ac.uk/sciences/ejournals.html

5.4.2.7.2 Scirus E-journals Search engine Scirus is the most comprehensive scientific research tool on the web. With over 450 million scientific items indexed at last count, it allows researchers to search for not only journal content but also scientists' homepages, courseware, pre-print server material, patents and institutional repository and website information

http://www.scirus.com/ 5.4.2.7.3 OJOSE Online Journal Search Engine is a free powerful scientific search engine enabling you to make search-queries in different databases by using only 1 search field. With OJOSE you can find, download or buy scientific publications (journals, articles, research reports, books, etc.) in up to 60 different databases.

93

http://www.ojose.com/ 5.4.2.8 ETD Search Engine Elsevier, world-leading publisher of scientific, technical and medical information products and services, is proud to announce the winners of the first awards for Electronic Theses and Dissertations (ETD) with the NDLTD-ETD Awards Powered by Scirus. Elsevier Journals Publishing and Scirus, the most comprehensive science-specific search engine, conducted the awards competition in partnership with the NDLTD (Networked Digital Library of Theses and Dissertations), the international organization dedicated to promoting the dissemination and preservation of electronic theses and dissertations, to sponsor this year's first-ever NDLTD-ETD Awards, which seek to recognize outstanding contributions to the body of electronically available ETD research.

94

http://www.ndltd.org/serviceproviders/scirus-etd-search The vast amount of information available on the Internet can make searching a long, Complicated process. The millennium search engines provide a more productive search by:  Focusing only on sites with subject-specific data.  Searching the “deep” web.  Filtering out irrelevant data. Library and Information Science professional should aware of invisible web tools and it helps to save the time and access to quality of information in short time. 5.4.3 Search engines can further be categorized by scope:  General Search Engine: It covers a range of services and facilities Boolean search. E.g. Google, AltaVista  Regional search Engine: It refers to country specific search engines for locating varied resources region-wise. Examples Euro Ferret (Europe), Excite UK (UK), etc.  Subject Specific Search Engine: It does not attempt to index the entire Web. Instead it focuses on searching for Websites or pages within a defined subject area, geographical area or type of resources. Examples: Geo index (Geography/Environmental Science). Biochemistry Easy Search Tool (Biochemisry). Because this specific search engine aims for depth of coverage within a single area, rather than breadth of coverage across subjects, they are often able to index documents that are not included even in the largest search engines databases. Some examples of subject specific search engines are:

 www.123india.com Regional  www.in.altavista.com Regional  www.yahoo.co.uk Regional  www.naukri.com Employment  www.ndtv.com News  www.zipcode.com Weather  www.khoj.com India-specific 5.4.4. Features of Search Engine: Search engines offer numerous features: 95

 When using a web search engine by entering more than one word, the space between the words has a logical meaning that directly affects the result of the search. This is known as the default syntax. For example, in Alta Vista, Infoseek and Excite, a search on the words : births migrations means that the searcher will get back documents that contain either the word “birds”, then word migration or both. The space between the words defaults to the Boolean OR. This is probably not what the searcher wanted for this search.  Search engines return results in schematic order. Most search engines use various criteria to construct a term relevancy rating of each hit and present the search results in this order.  Criteria can include: search terms in the title, URL, first heading, HTML tag; number of times search terms appear in the document; search terms appearing early in the document; search terms appearing close together. etc, Google page ranking algorithm displays mostly cited/hyperlinked Web sites/Web pages at the top of the screen.  One of the most interesting developments in is the organisation of search results by concept, site, domain, popularity and making rather than by relevancy. Search engines that employ this alternative may be thought of as second-generation search services. For example:  Direct Hit ranks according to sites other searches have chosen from their results to similar queries.  Google ranks by the number of links from pages ranked high by the service.  Inference Find ranks by concept and top-level domain.  Northern Light sorts results into Custom Search Folders representing concepts and/or types of sites.  Often multiple pages are retrieved from a single site because all they contain the given search term. Alta Vista, Infoseek, HotBot, Northern Light and Lycos avoid this by a technique called results grouping, whereby all the terms from one site are clustered together into one result. It provides the opportunity to view all the retrieved pages from that chosen site. With these engines, one may get a smaller number of results from a search, but each result is coming from a different site.  Search engines do not index all the documents available on the web. For example, most search engines cannot index files to password-protected sites, behind firewalls or 96

configured by the host server to be left alone. Still other web pages may not be picked up if they are not linked to other pages and are therefore missed by a search engine spider as it crawls from one page to the next. Search engines rarely contain the most recent documents posted to the Internet, do not look for yesterday’s news on a search engine.  Contents of database will generally not show up in a search engine result. A growing amount of valuable information on the web is not generated from the databases. This aspect of the Web is sometimes referred to as “the invisible Web” because database content is “invisible” to search engine spiders.  Some search engines allow users to viewed display of the retrieved Web sites/Web pages, clustered under different topics related to the search term(s). Examples include Kartoo (http://www.kartoo.com), Vivisimo (http://www.vivisimo.com), etc. 5.4.5 Major Search Engines-Feature Guide A web search engine is designed to search for information on the World Wide Web. The search results are generally presented in a list of results and are often called hits. The information may consist of web pages, images, information and other types of files. Some engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input.

Ask Bing Google Yahoo www.ask.com www.bing.com www.google.com www.yahoo.com For almost all the features indicated below, corresponding options are available on the engine’s advanced search page Boolean term term term term(defaults term term(defaults term(defaults term(defaults to to an AND) –term to an AND) –term to an AND) – an AND) –term (for NOT)term OR (for NOT) term OR term (for (for NOT) (term term) term NOT) (term OR term) OR term) Phrase “ ” “ ” “ ” “ ” some automatic some automatic stemming stemming 97

Title Field intitle:term intitle:term intitleterm:allintitle intitle:term :term1 term2 Site/URL inurl:term site:term inurl:term site:term Field term site:term allinurl:term term site:term “Links to” link:term For link searching, A URL go to: siteexplorer.search. yahoo.com File Type filetype:extension filetype:extension filetype:extension Language Choice of 6 Language:xx Choice of 44 Language:xx languages on Choice of 41 languages on Choice of 41 Advances languages on Advanced search languages on Search page Advanced search page Advanced search page page Numeric nnnn..mmmm Ranges e.g.1850..1899 Media Images search Images search Images search Images search Searching Video search Video search Video search Video search Similar yes pages Also Link to cached Link to cached “Show options” Link to cached page Shown on page Related page Link to cached Related searches results Topics Related searches page Shortcuts, etc. Pages Search history Translations Stock quote option Stock quotes options News headlines News headlines Stock quote option Link to definitions, news, video, blog results Addresses/phone #s News headlines 98

My Search History Outstandi Hasfeed: Maps covers many “My Yahoo” ng special inanchor: formats hasfeed: features ip: Numeric range inanchor: location: search ip: Equation solver Language Tools location:

Other Maps, News, Maps, News, Maps, News, Maps, News, searchable Shopping, Q& Shopping, Travel, Shopping, Books, Shopping, Answers, databases A, TV Twitter, Visual Groups, Scholar Directory, creative Listings, (journals), Blogs, Commons, Jobs, Events, Earth, Patents, People, Travel Recipes, Blogs Code

Today, Google is working as primary information resource of internet users. If we check the statistics about search engines for last four years, we can see that Google and Yahoo! are leading the top search engines list since 2006. Since 2006, Google is the top and most used search engine. Yahoo follows Google at the second place. Between 2006 and 2008, Msn/Live was the third most used search engine but in 2009, Msn/Live gave the place to their new and successful search engine Bing. These ranks determined according to the preferences of users. 5.4.6 Functions of Search Engines There are differences in the ways various search engines work but they all perform three basic tasks:  They search the Internet by using specialized software called crawler or robot; these software/agent can find out web pages by following hyperlinks.  These agent/software send the cached version of web pages to the repository of a search (SE) and SE keeps an index of the words they find and where (URL) they find them. Significance of Search Engines  User wants right information at the right time for the right cause in the right format at the right place. 99

 It is easiest way to group together and present numerous types of information and services that can be available to all kind of users.  To improve the user-friendliness and to enable convenient access to the different kinds of information and services mounted on the web by users.

References: 1. Babu, V Ramesh., & Kavitha, A.B. (2008). Challenges of digital library initiatives and management. National conference on Changing Dimensions in Library Resources and Services in the Digital Era, Kattankulathur, 333-336.

2. Bearman, D. (2007). Digital Libraries. Annual Review of Information Science and Technology 41,223-272.

3. Bhattacharya, P. (2004). Advances in digital library initiatives: a developing country perspective. The International Information & Library Review 36(3), 165–175. 100

4. Capra, R., & Pérez-Quiñones, M.A. (2005). Using Web Search Engines to Find and Refind Information. IEEE Computer 38(10), 36-42.

5. Choudhury, G.G. (2004). Access to Information in Digital Libraries: Users and Digital Divide, International Conference on Digital Libraries Knowledge Creation, Preservation, Access and Management, New Delhi,(1),56-64.

6. Das, T. K., Sharma, A.K., & Gurey, P. (2009). Digitization, Strategies & Issues of Digital Preservation: An Insight View to Visva-Bharati Library. 7th International CALIBER © INFLIBNET centre, Pondicherry University, Puducherry, Ahmadabad.

7. Das. U. (2006). Importance of preserving historical materials of Assam. New Spectrum, 2(1), 15- 24.

8. Ganzha, M., et.al (2010). Combining information from multiple search engines- preliminary comparison. Information Science 180(10): 1908-1923, retrieved on December 14, 2012 from http://dx.doi.org/10.1016/j.ins.2010.01.010

9. Gupta, S., & G. Singh. (2006). Management of digital libraries: Issues and strategies. Journal of Library and Information Science 13(1).

10. ICPSR. Digital Preservation Strategies. Retrieved on December 10, 2012 from http:// www. i cps r. umi ch .edu/ dpm/ dpm- en g/ terminology/strategies.html

11. Jain, P.K., & Praveen Babbar. (2006). Digital libraries Initiatives in India. The International Information & Library Review, 38, 161-169.

12. Jones, C., Zenios, M., & Markland, M. (2003). Digital Resources in Higher Education: Pedagogy and Approaches to the use of digital resources in Teaching and Learning, Northern Ireland. Retrieved from http://www.ifets.info/journals/10_1/6.pdf.

101

13. Jones, C., Zenios, M., & Griffiths, J. (2004). Academic use of digital resources: Disciplinary differences and the issue of progression. In Banks, S., Goodyear, P., Hodgson, V., Jones, C., Lally, V., McConnell, D. & Steeples, C. (Eds.), Networked Learning 2004: Proceedings of the Fourth International Conference on Networked Learning 2004, Lancaster: Lancaster University and University of Sheffield, 222-229, retrieved on December 13, 2012, from http://www.networkedlearningconference.org.uk/.

14. Kaur, P., & S. Singh. (2005). Transformation of traditional libraries into digital libraries: A study in the Indian context. Herald of Library Science 44(1–2), 33–39.

15. Kemp, B., & Jones, C. (2007). Academic Use of Digital Resources: Disciplinary Differences and the Issue of Progression revisited. Educational Technology & Society, 10 (1), 52-60.

16. Malaghan, C.A., & Chowdappa. (2012). Tools and Techniques to explore search Engines and Databases, National Conference on Trends in Developing & Managing E-resources in Libraries, 184-185.

17. Mahesh, G., & Mittal Rekha. (2008). Digital Libraries in India: A Review, National Institute of Science Communication and Information Resources, New Delhi, 58, 15– 24.

18. Meitei, L.S., & Devi, P. (2009). Open Source Initiative in Digital Preservation: The Need for an Open Source Digital Repository and Preservation System. 7th International CALIBER © INFLIBNET centre, Pondicherry University, Puducherry,

19. Parekh, Y.R., & Parekh, P. (2009). Planning for Digital Preservation of Special Collections in Gujarat University Library. 7th International CALIBER © INFLIBNET centre, Pondicherry University, Puducherry,

20. Praveen Singh, C. (2008). Digital Libraries Tool & Techniques, Alfa Publication, New Delhi.

102

21. Sanjay Kumar, S. (2006). Microfilm: a digital archiving method of preservation. New Spectrum. 2 (1), 7-14.

22. Saravana, C.G., & Anjaiah, M. (2012). Digital preservation: Issues and Concerns. National Conference on Trends in Developing & Managing E- resources In Libraries, Karnataka, 278-282.

23. William Y. Arms. (2005) Digital Libraries, Ane Books, New Delhi.