Draft: - A. Gold (UCSD Libraries), K. Baker (SIO/LTER)

Total Page:16

File Type:pdf, Size:1020Kb

Draft: - A. Gold (UCSD Libraries), K. Baker (SIO/LTER)

CDS@SDSC: A Prototype Document Management System (DMS) at SDSC Using CERN’s CERN Document System (CDS)

DEVELOPMENT PLAN OVERVIEW January 10, 2002

Draft: - A. Gold (UCSD Libraries), K. Baker (SIO/LTER)

INTRODUCTION :

SDSC’s Integrative BioSciences (IBS) program is planning to implement an integrated document management system (DMS) based on the CERN document system (CDS). The Integrative Biosciences DMS will be an open, searchable, extensible, customizable repository of citations/records providing a means to gather, extract, share, and discover documents carrying information related to individuals and research associated with biology and bioinformatics.

PROJECT WORKING NAME:

 CDS@SDSC

PROJECT TEAM & TENTATIVE ASSIGNMENTS:

 Project Champion: Phil Bourne  Project Management Team: Kim Baldridge, Dave Hart, Anna Gold, Karen Baker  System Administrator: Christopher Smith  Development Team: CERN Team: Jean-Yves LeMeur (lead), Thomas Baron, Tibor Simko SDSC Team: Frank Sudholt (lead), Christopher Smith, Ben Tolo  Communications: Dave Hart  Project Management Advisors: Mike Vildibill, Reagan Moore  Security Issues: Erin Kenneally

PROJECT DEVELOPMENT MODEL:

The proposed model for development is iterative,and stepwise, cycling between testing and modifying the DMS. This reflects the reality of what usually occurs in developing a system, and so provides a more realistic schedule. We anticipate that analysis, design, development, and the quality assurance stages will recur over time, throughout the first several phases or cycles. For example, analysis may occur in months 1, 4, and part of 6, while the design stage occurs in months 2, 3 and 4, and 6-7. Quality assurance also overlaps in time with the later development stages as well as ending concurrently with implementation stage.

1 During the implementation stage improvements will also be identified by successful deployments of the software, with managers to make decisions about the feasibility of adding functionality and resources at each new overlapping cycle. This approach is appropriate to the CDS@SDSC project because it is best suited to a situation where funding is limited, parties are new at working together, repository requirements are (potentially) complex, and public exposure is potentially high.

SUPPORTED SCENARIOS (to be developed with SDSC users):

A. Submit/deposit scenario with review B. Browse scenario C. Search scenario D. Upload scenario E. Report / export scenario F. Share scenario

DEVELOPMENT GOALS:

The goals outlined below are divided into three cycles: Pre-project development; development cycle I; and development cycle II. These cycles represent an anticipated linear progression from preparation to initial implementation, to enhancement and full implementation. However, activities of feasibility analysis, design, development, management, communication, quality assurance, and implementation will take place in overlapping iterations throughout the development cycles, based on what is learned in the process of pursuing the goals identified below.

Goals of the Pre-Project Development Cycle (July – December 2001)

1. Identify project needs:

. manage access to documents and metadata in a system that supports the following key functions: o Gather structured resource information both in batches and via individual submissions. o Share collections of information within research networks o Discover relevant research, partners, updated results, and linkages o Report indicators of research funding outcomes and impacts via analysis of data in the system.

2. Establish solution requirements (see also FLOW presentation, November 9, 2001)

. compliant with standards such as the OAI . modifiable (open, at least for purposes of modification by developers here) . suitable for immediate needs and rapid deployment . interactive with the research community outside the organization . generalizable - usable for any number of differing "collections" (as CDS appears to be) . extensible - a system we could build on, e.g., to add services such as citation counting/reporting, summarizing, etc. . low initial cost

2 . use existing personnel and expertise . available now . rapid prototyping and development

3. Choose solution (CERN CDS selected, November 2001)

. CDS was determined to be: o compliant with standards such as the OAI o modifiable (open, at least for purposes of modification by developers at SDSC) o suitable for immediate needs and rapid deployment o interactive with the research community outside the organization o generalizable - usable for any number of differing "collections” o extensible - a system we can build on, for example, to add services such as citation counting/reporting, summarizing, etc. o adaptable to the requirement to manage access to resource descriptions of people o reasonably priced ($6000, academic pricing) o available now o available with technical support

4. Assemble resources (CDS Technology Transfer fee; hardware; programming and systems support; funding for student assistance).

. Resources assembled: o “Technology Transfer Workload" agreement (Migration - Training - Testing - Support - Maintenance) o On-site system experts to set up the server system, run the basic installations (MySQL, php, apache) and learn how to handle/configure CDS. o Initial project planning team

Goals of Development Cycle I (January – July 2002)

1. Implement CDS software, associated protocols, and test all key features, including:

 Gather, defined as the ability to: o manage records and files consisting of metadata-only, and metadata plus documents/files to be stored on the DMS server. o accept manual entry of individual records/documents for a variety of document types o accept manual corrections of individual records and documents o upload collections of records with or without associated files, from import sources such as EndNote databases or flat files o upload records from SDSC relational database of personnel

 Extract and report, defined as the ability to: o export records from the DMS in a print-friendly format o export records from the DMS in XML format o email a record or selection of records from the DMS

3 o add statistics capability; citation reporting?

 Discover, defined as the ability to: o support user browsing of the repository hierarchy o support users executing simple and advanced searches for items within a single DMS collection o support users executing simple and advanced searches for items across multiple DMS collections

 Share, defined as the ability to support: o single or groups records export to file format usable by EndNote o document submission workflow with review steps o collecting and sharing custom record collections with custom mailing lists

2. Begin to analyze how other SDSC research, such as the SRB, is relevant to the goals and implementation of CDS@SDSC, and establish internal partnerships to ensure compliance of the DMS with research-supported work on implementing distributed research repositories.

3. Identify appropriate audiences and venues for describing progress on the project to interested research communities.

4. Identify resources needed to manage data resource over next phase.

5. Review and modify original outline for next cycle of development and resources needed to implement the next cycle.

Second Development Cycle (July 2002 – July 2003) Goals:

1. Improve and customize report generation utilities to meet needs of funding agencies.

2. Implement additional CDS functional modules, such as customized “baskets”, email alerting, on-the-fly file conversion, and citation extraction and analysis.

3. Extend use of SDSC repository to all SDSC and NPACI groups.

4. Integrate SRB technology with CDS.

5. Participate in research discussions with other repository developers.

“ CARTOON” MODIFIED GANT CHART

MILESTONES AND TARGET DATES:

Dependencies of target dates; see ‘cartoon’ modified Gant Chart: purpose, to think through systematically and identify holes; not intended as prescriptive.

4 Pre-Project Cycle Milestones:

1. Identify requirements for a resource repository: completed October 2001 2. Identify and compare candidate technologies and support (OAI and CERN-CDS): completed October 2001 3. Select technology (CDS): approved November 2001 4. Identify existing repositories of data on which CDS@SDSC will depend: completed December 2001 (none for bibliographic / document-like data; SDSC personnel database for people/parties data) 5. Identify candidates for project team: completed December 2001

Cycle I Milestones:

1. Complete technology transfer agreement: January 2002 2. Assemble needed hardware and project team: January 2002  hardware on order, expected delivery end of January 3. Complete project plan: January 2002 4. Develop preliminary metadata requirements for documents (resources) and people (parties): January 2002 5. Develop data structure for documents (resources) and people (parties): February 2002 6. Establish initial security measures (identify groups and privileges, SDSC/NPACI security team review): February 2002 7. Complete data structure for documents (resources) and people (parties): February 2002 8. Install CDS Software: January -February 2002 9. Implement initial metadata and visible hierarchy for documents: March 2002 10. Implement initial metadata and visible hierarchy for people/parties: March 2002 11. Demonstrate single record input and update for documents (resources) and people (parties): March 2002 12. Demonstrate batch record upload for documents (resources) and people (parties): March 2002 13. Modify CDS@SDSC to accommodate new or revised document data from IBS/SDSC: April 2002 14. Modify CDS@SDSC to accommodate import of people table from SDSC: April 2002 15. Demonstrate administrative review procedure for new records: April 2002 16. Demonstrate/describe project at one UCSD venue: May 2002 17. Report on relationship between CDS@SDSC and SRB 18. Determine feasibility of extending CDS@SDSC to other research networks, e.g. NPACI, Cal(IT)2 19. Describe project in national venue: July 2002 (JCDL, Portland Ore.)

Proposed Cycle 2: July 2002 – July 2003

1. Demonstrate report generation in support of funding agencies requests 2. Demonstrate customized “baskets” and email alerting 3. Demonstrate on-the-fly file conversion 4. Demonstrate citation extraction and analysis 5. Extend use of SDSC repository to all SDSC and NPACI groups 6. Integrate SRB technology with CDS

5 7. Participate in research discussions with other repository developers

TECHNICAL DOCUMENTATION:

1. Report number generation: /export/home/cds/public_html/submit/counters contain all counters for generating document numbers.

2. Document types supported: http://documents.cern.ch/EDS/v2_0/guide/english/documents.php or up-to-date version at:

3. Existing catalogs structure: http://weblib.cern.ch/cataloguemap.php

4. Uploading Procedures:

We wrote a simple script that uploads textual data dumped from the ALEPH commercial librarian system. The script is not generic (yet) but we can probably adopt it to your needs.

5. Metadata Structure:

Note that CDS uses the standard librarian MARC-21 format to store its bibliographic data. So, please send me: (1) an example of structured text file you would like to upload from; (2) preferred MARC-21 tags you would like to upload into.

Example:

,---- INPUT: ALEPH sequential format | CER 0175429 TI L Monolithic charge-to-amplitude converter circuit | CER 0175429 AU L Ikeda, H | CER 0175429 AU L Ikeda, M | CER 0175429 AU L Inaba, S | CER 0175429 AU L Fujita, Y | CER 0175429 KW L $$iradiation hardness `----

,---- OUTPUT: desired MARC-21 format | 001 175429 | 100 __ $a Tanaka, M | 245 1_ $a Monolithic charge-to-amplitude converter circuit | 650 __ $a radiation hardness | 700 __ $a Ikeda, H $a Ikeda, M $a Inaba, S $a Fujita, Y

Various categories have been set up for various catalogues. This is completely catalogue- dependent and can be very easily adapted to special needs (when navigating from weblib.cern.ch within Photos, Preprints or Books you can see some categories). For the descriptive elements, as

6 we are based on MARC21, the list of possible elements is almost unlimited. Let me know if you want me to forward you the MARC codes currently in use at CERN CDS.

All attributes of MARC 21 can be assigned to a document to describe it in the database. The element descriptions are a bit different: they are attributes at the user interface level, for describing fields entered by document submitter. The distinction is that you often combine user input fields to make a single MARC field (eg: conf. name + conf town, etc); or you may also ask some info that is not stored in the database (like maybe the submitter of the document email, etc). So, the matching Element description <-> Marc field is not one to one.

6. Metadata Fields:

Attributes/fields for documents: http://buddha.ucsd.edu/~cds/submit/manager/allElementsEDS.php

7. Network requirements: . TCP/IP . Internal bandwidth . External bandwidth

8. Security requirements:

8. Hardware requirements:

Based on estimated scale of: . simultaneous users/searches . number of documents/records

. RAM: 512Mb . second CPU . storage space . Unix (Sun or Linux) . Hardware acquired; o SUN Starfire 280 (2 750 MHz UltraSparcIII, o 8 MB Cache, 4 GB memory, 72 GB disk)

9. Database requirements:

. MySQL (3.23.x, where x>23) i. httpd.conf should contain index.php within DirectoryIndex directive. ii. php.ini should contain something like: upload_tmp_dir = /tmp upload_max_filesize = 40000000 log_errors = on display_errors = off expose_php = off post_max_size = 40000000 session.save_handler = user session.auto_start = 0

7 session.name = SESSIONID session.gc_maxlifetime = 60000 session.cookie_lifetime = 0

10. Software

. uploader:

8

Recommended publications