E U R O P E A N S O U T H E R N O B S E R V A T ORY Organisation Européenne pour des Recherches Astronomiques dans l'Hémisphère Austral Europäische Organisation für astronomische Forschung in der südlichen Hemisphäre

ESO - EUROPEAN SOUTHERN OBSERVATORY

DFS

NGAS Operations & Troubleshooting Guide

VLT-MAN-ESO-19400-3103

Issue 1 2003-07-28 45 pages

Prepared: J. Knudstrup 2003-07-28

Name Date Signature

Approved: M. Peron/B.Pirenne

Name Date Signature

Released: P. Quinn

Name Date Signature

ESO * TELEPHONE: (089) 3 20 06-0 * http://www.eso.org Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 2 of 45

CHANGE RECORD

Issue Date Affected Paragraphs(s) Reason/Initiation/Remarks 1 2003-07-23 All First version.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 3 of 45

TABLE OF CONTENTS

1. PURPOSE & SCOPE ...... 4 1.1 Reference Documents...... 4 1.2 List of Abbreviations/Acronyms & Terminology ...... 4 1.3 How to Read & Use This Manual ...... 6 1.4 How to Get Help or Report Problems with NGAS or this Manual ...... 6 1.5 NGAS Operators at the NGAS Sites...... 6 2. GENERAL RULES & GUIDELINES ...... 7 2.1 System Description...... 7 2.2 User Accounts “ngas” and “ngasmgr” ...... 7 2.3 Basic Guidelines & Rules ...... 8 3. OPERATIONS PROCEDURES...... 10 3.1 NGAS Host Shut Down/Reboot Procedure ...... 10 3.2 Disk Preparation ...... 10 3.3 Checking System Health During Operation ...... 12 3.4 Handling of Disk Change at the Archiving Sites ...... 13 3.5 Data Consistency Checking ...... 14 3.5.1 Data Inconsistency: Checksum not Available...... 16 3.5.2 Data Inconsistency: Not Registered Files Found ...... 17 3.5.3 Data Inconsistency: Checksum Illegal...... 18 3.5.4 Data Inconsistency: File in DB Not Found on Disk ...... 19 3.5.5 Data Inconsistency: Disk Activity LED Permanently On ...... 19 3.6 Data Migration...... 20 3.7 Creation of the Mondo Rescue/Installation CD...... 20 3.8 Installing NGAS Host with Mondo Rescue Image ...... 21 4. TROUBLESHOOTING PROCEDURES ...... 23 4.1 General Lines of Direction...... 23 4.2 Check that NG/AMS Server is Running Properly...... 24 4.3 Starting/Stopping the NG/AMS Server Manually...... 24 4.4 Command Level Server Interaction...... 25 4.5 Inspection & Interpretation of the NG/AMS Log File...... 26 4.6 Execution of NG/AMS Server in Verbose Mode...... 27 4.7 Inspection of NG/AMS Configuration...... 28 4.8 NGAS DB not Updated in NGAS WEB Interfaces ...... 28 4.9 Disk Activity Lamp Permanently On...... 29 4.10 NGAS WEB Service not Working Properly...... 29 4.11 Error Message Indicates that No Storage Sets are Available ...... 29 4.12 Problems Accessing DB, Data Back-Log Buffered ...... 30 4.13 NAU System/HW Failure ...... 31 4.14 NGAS Host Acoustic HW Alarm Sounding...... 31 5. NGAS WEB SITE ...... 33 5.1 NGAS WEB Site Main Entry Page ...... 33 5.2 NGAS WEB Site Main Page/ESOECF...... 34 5.3 NGAS WEB Site Main Page/OLASLS ...... 35 5.4 Disk Status Tool ...... 35 5.5 NGAS Disk Status Form...... 38 5.6 Host Status Tool ...... 40 5.7 Search for Archived Frames ...... 42 5.8 NGAS Last Night Report...... 44 5.9 Last Frames Archived ...... 45 5.10 Label Printer Tool...... 45 Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 4 of 45

1. PURPOSE & SCOPE

The purpose of this manual is to provide a set of procedures to be applied by the NGAS Operators at the various NGAS Sites in their daily handling of NGAS. In addition procedures are contained to assist the NGAS Operators in resolving problems that may occur during the operation of an NGAS System. Apart from the procedures contained in this document, information about NGAS in general is contained in the document, which may help the operator in getting a better overview of the system.

The basic goal with this manual is to make it possible for the NGAS Operators in general to operate autonomously, i.e., without intervention/assistance from the NGAS Garching Team. If a task cannot be solved or a problem addressed in a straightforward manner without the assistance from the NGAS Team, this is an indication that the manual should be enhanced in this area with an additional procedure or additional information. It is therefore important that the NGAS Operators inform the NGAS Team about missing information in this document. It is probably not possible to cover all kinds of difficulties that might be encountered with NGAS, and in some cases assistance from the NGAS Team will probably always be necessary.

This document is mostly directed towards the people carrying out the operation of NGAS at the various ESO NGAS Sites (Garching, La Silla and Paranal). If NGAS is used elsewhere, still many of the procedures and information is valid, but minor adaptations might be needed.

1.1 Reference Documents

The following documents are referred to in this user guide:

Reference Issue Date Title "DFS Software, Next Generation/Archive Management VLT-MAN-ESO-19400-2739 1 - System", Design Description, J.Knudstrup. “DFS Software, NGAS Acceptance Test Plan & Hands- VLT-PLA-19400-3100 1 - On Tutorial”, J.Knudstrup. “NGAS System Installation & Configuration Manual”, VLT-MAN-19400-3104 1 - J.Knudstrup/A.Piemonte/D.Suchar.

1.2 List of Abbreviations/Acronyms & Terminology

The following abbreviations are used in this document:

Abbreviation Explanation AHU Archive Handling Unit. BIOS Basic Input/Output System. DBMS Database Management System. HQ Head Quarters. HTTP Hypertext Transfer Protocol. HW Hardware. NAU NGAS Archiving Unit. NBU NGAS Buffering Unit. NCU NGAS Cluster Unit. NMU NGAS Master Unit. NG/AMS Next Generation Archive Management System. NGAS Next Generation Archive System. OLAS Online Archive System. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 5 of 45

OS SW Software. XML Extensible Markup Language.

The following terminology is used in this document:

Term Explanation Data Disks Refers to the removable hard disk drives in an NGAS Host on which data files are stored. Data Production Site Is an NGAS Site where data is produced (typically observatory site). Disk ID A unique identifier, which is generated from the BIOS of a hard disk drive. Logical Name (Disk The Logical Name (or Disk Label) is a name allocated to each disk, which is human readable. Label) This is allocated dynamically for each disk according to a running, global index (number). A Disk Set consists of a Main Disk possibly together with a Replication Disk (Single-Disk Disk Main Disk Set versus Double-Disk Disk Set). The NGAS Disk Info XML Document, is a small file containing the important information about NGAS Disk Info a disk. This is stored in the root area on the disk, e.g. “/NGAS/data1/NgasDiskInfo”. NGAS Folder, The Is a folder containing all the relevant information (documentation) about NGAS. Should always NGAS Folder be at hand for the NGAS Operators to support the daily operation and special support. NGAS Host Is a computer (IBM compatible PC) running the NGAS run-time environment. NGAS Operator Person who takes care of the daily operations of the NGAS system. NGAS Site A site where one or more NGAS Hosts are operated. The NGAS SW is the SW used to handle archiving of data etc. It is the heart of NGAS. This SW NGAS SW package is called NG/AMS. Refers to an NGAS infrastructure installed e.g. at the telescope site. It usually consists of several NGAS System NGAS Hosts which are ‘synchronized’ via the NGAS DB. The NGAS Team is the NGAS Development and Support Team located in ESO Garching. The NGAS Team NGAS Team can be contacted using the email address: [email protected]. On each NGAS Host there must be two user accounts. These are named “ngasmgr” and “ngas”. Former is used to deal with issues related to the configuration of the NGAS Host, whereas latter NGAS User Accounts is the run-time account under which the NGAS SW is running. All data archived, log files and other files produced by NG/AMS, are owned by the user “ngas”. The NGAS WEB Site is used by NGAS Operators to get an overview of the system. It is for NGAS WEB Site instance possible to see which disks are mounted and where, and to see which files are archived on which disks. The main URL is: “http://jewel1.hq.eso.org:8080/NGAS”. The configuration which defines how the NG/AMS Server should operate. This file can be NG/AMS Configuration located following the link: “/etc/ngamsServer.conf”. An NGAS Host (or rather the NG/AMS Server) can be in Offline State. This means that the Offline State NG/AMS Server is running on the host, but it is not ready to carry out normal operation (handle Archive Requests, Retrieve Requests, and other requests. Is the Online Archive System running on the DHS machine. It carries out various basic checks OLAS (System) on the data to archive and distribute the data to various subscribers. An NGAS Host (or rather the NG/AMS Server) can be in Online State. This means that the Online State NG/AMS Server is running on the host, and is ready to carry out normal operation (handle Archive Requests, Retrieve Requests, and other requests). Replication Disk See Main Disk. To shut down an NGAS Host a simple procedure has been chosen whereby the NGAS Operator Shutdown Procedure simply has to press CTRL-ALT-DEL on the keyboard connected to the host to initiate the shutdown procedure. Test Case Is an instruction or a set of instructions to be carried out to test a certain property of the system. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 6 of 45

Test Suite Is a collection of related Test Cases. Thread Is an independent sub-process running within the context of a given process on the OS. It is possible to run the NG/AMS Server in Verbose Mode whereby logging information is Verbose Mode written to “stdout” according to a level given as input parameter. Verbose Output Log information generated by the NG/AMS Server and written on “stdout”.

1.3 How to Read & Use This Manual

Operations Procedure are presented as follows:

Procedure:

Step 1 Description Step 2 Description … …

Great efforts have been invested in making these as precise as possible. It is therefore essential to follow the steps accurately and in case of misunderstandings, ambiguities or missing information, to report this to the NGAS Team to improve the Operations Procedures.

Additional background information is given in info boxes as shown here:

Information

These contain important/useful information that assists the NGAS Operator in carrying out the tasks/operation and which may help giving the NGAS Operator a more profound understanding of the system and how it works. This may help the operator in solving problems in case something is not working as foreseen.

1.4 How to Get Help or Report Problems with NGAS or this Manual

In case problems are encountered using NGAS, bug/problems reports can be submitted via email to the NGAS Team in Garching:

[email protected]

This also goes for questions and other assistance needed in connection with the operation of the system.

The NGAS WEB Site:

http://jewel1.hq.eso.org:8080/NGAS also provides assistance for the normal operation and for troubleshooting in case problems are encountered.

1.5 NGAS Operators at the NGAS Sites

The following email addresses can be used to contact the NGAS Operators at the various ESO NGAS Sites:

[email protected] – NGAS Operators at HQ.

[email protected] – NGAS Operators at La Silla.

[email protected] – NGAS Operators at Paranal. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 7 of 45

2. GENERAL RULES & GUIDELINES

This chapter contains some general rules to be applied by the NGAS Operators when managing the NGAS System and some guidelines to assist in this process as well.

2.1 System Description

In the NGAS terminology an NGAS Host is a PC running (RedHat Linux in the case of ESO) with a special configuration on top of the standard Linux installation. On each NGAS Host, the NGAS SW (the NG/AMS Server), must be running. The NG/AMS Server is a process dedicated to handle all the services provided by the NGAS System. This is e.g. archiving and retrieving of data, cloning of disks and files, recycling of disks and more.

The HW solution in use at present, is based on removable hard disk drives whereby up to eight such removable disks can be hosted per NGAS Host.

An NGAS Installation at the data production site (mostly the telescope sites) usually consists of one or more NGAS Archiving Units (NAUs), NGAS Buffering Units (NBUs) and an Archive Handling Unit (AHU). These names do not mean that there are any differences in the basic configuration of the HW, OS or SW of the machines. The names merely refer to the role of each of the NGAS Hosts. This role is determined by the NG/AMS Configuration.

The NAU is the unit handling the actual archiving of data. The NBU is used to keep the data on the Replication Disks permanently online and to check the data on the disk, until the Main Disk arrives at the HQ. The NBU is also used as hot- swappable/stand-by machine for the NAU. If there are HW problems with the NAU, it should be possible to change the NBU into an NAU within 30 minutes to continue the archiving of data.

The AHU is used by the NGAS Operators to prepare disks (at HQ), to create copies of data and disks (data/disk cloning), to carry out Data Consistency Checking on Data Disks, and other tasks that require an NGAS System for doing operation on the disks and on the data stored on these.

The role of an NGAS Host is determined by the NG/AMS configuration parameters” “Allow Archive Requests”, “Allow Retrieve Requests” and “Allow Remove Requests”.

An NAU would usually allow to archive data only. An NBU possibly to retrieve data but not to archive data. An AHU to archive, retrieve and remove data. Note, a fourth parameter in this category of parameters exists, which makes it possible to enable/disable data processing on the NGAS unit.

In an NGAS Cluster configuration two additional types of units are used. These are NGAS Main Unit (NMU) and NGAS Cluster Unit (NCU). Former is the main node and entry point to the cluster, whereas latter are the sub-nodes in the cluster. The NMU usually acts as proxy for the nodes so that clients communicate only with the NMU, which forwards the requests e.g. for data to the sub-nodes and returns the data to the requestor.

2.2 User Accounts “ngas” and “ngasmgr”

On an operating NGAS Host, two user accounts are used to handle the system. These are:

1. User “ngas”: This user is the ‘operational account’. Under this account the NGAS SW is running. All data on the NGAS Data Disks, and other data created and updated by NG/AMS are owned by this user.

2. User “ngasmgr”: This account is used to handle the NGAS specific part of the configuration of an NGAS Host.

The NGAS Operator should pay attention to as which user to use when logging into an NGAS Host. Usually (s)he should use the “ngas” account to log in to e.g. check the conditions of the system or to investigate a problem. If as a result of these Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 8 of 45 investigations, it is necessary e.g. to change the NG/AMS Configuration or other NGAS specific configuration, this should be done as “ngasmgr”.

Apart from the users “ngas” and “ngasmgr” there is also a “root” user as on any standard Linux/UNIX system. The “root” user is used, as is common practice, to manage the basic OS installation.

The root password should be available on each NGAS Site, but the usage of this account is discouraged in general, and it is recommended only to use it if a problem in the OS configuration needs to be addressed. This should normally always be coordinated however, with the NGAS Team.

In general, the passwords of the “ngas”, “ngasmgr” and “root” accounts are only known to the NGAS Operators of each site. I.e., not even the NGAS Team control these and are aware of these.

2.3 Basic Guidelines & Rules

In order to maintain a stable system and to keep the overview of an NGAS installation, the NGAS Operators are kindly requested to follow some basic guidelines and rules. These are:

• Data Persistency: The most fundamental rule for an archive system is that no data should be lost ever. The system itself should provide the necessary services and features to prevent from loss of data, but cooperation from the operator’s side to support this and a thorough understanding of the system from the operator’s side, is also very important to make this possible. Each operator is asked to consider carefully before carrying out any action that might compromise this basic rule of avoiding any loss of data.

• Manual Data Handling: It is strictly forbidden to delete, move or modify contents on the NGAS Data Disks on the NGAS Operators own initiative. Only if explicitly indicated to do so by the Operations Procedures or agreed with the NGAS Team to carry out such changes this can be done. In any case, under normal circumstances it is not necessary to touch manually the information on the Data Disks. This is only necessary in case of problems or abnormal operational conditions.

• Access Restriction: The passwords for the NGAS User Accounts “ngas” and “ngasmgr” should be allocated by the NGAS Operators at each NGAS Site. These should not be given to anybody outside this group of people.

• System Administration: The root password should be known to the NGAS Operators, but only used in case of emergency. The root password could also be known to the system administrators of the site to enable them to assist in handling more ‘complex’ problems with the system. In general, the operator should not change in any way the configuration, OS parameters or set-up of the NGAS User Accounts.

• Usage of NGAS User Accounts: Logging into the NGAS Hosts as user “ngas” or user “ngasmgr” should usually be avoided unless explicitly requested by the Operations Procedure to do so or in case of troubleshooting/problem solving requiring to log on to the host. The NGAS WEB Interfaces should be used first to address a possible problem.

• Manipulating Configuration: Logging into the NGAS Hosts as user “ngasmgr” and introducing changes in the configuration without prior consulting the NGAS Team should be avoided. If changes are introduced, these should be reported to the NGAS Team so that this can be recorded and in the case of e.g. an NG/AMS Configuration, this configuration be updated in the configuration control repository.

• System Back-Up: If changing the configuration is necessary, a new Mondo Installation CD of the new configuration should be created and replace the old one in the “NGAS Folder”.

• Disk Handling: It is forbidden to unlock/lock or remove/insert disks while an NGAS Host is switched on. This should only be done when the host is powered off.

• Host Power On/Off: It is forbidden to power off an NGAS Host just by pressing the power button. Likewise it is forbidden to press the reset button on the front panel while the system is operating. The built-in mechanism based on the CTRL-ALT- DEL keys combination should be used to ensure a clean shutdown. Only if for some reason the machine ‘hangs’ it might be necessary to shut it down by pressing the power button. Before applying this option however, it should be tried to use the ‘reset’ button.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 9 of 45

• Temporary/Work Files/Directories: Temporary files can be put under a “tmp” directory under the NGAS User Accounts i.e. “/home/ngas/tmp” and “/home/ngasmgr/tmp”. It is recommended to avoid ‘contaminating’ these accounts with files and folders and it is recommended to clean up after having created such temporary files.

In general, in case of doubts when it comes to carry out maintenance or troubleshooting actions it is always good to first consult the NGAS Team. If this is not possible, it goes without saying, that the most important ‘rule’ is to keep the system operating. Just remember to inform the NGAS Team about the actions carried out. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 10 of 45

3. OPERATIONS PROCEDURES

The following sections contain descriptions of how to carry out the most common procedures for the normal operation of the NGAS System. If new cases are discovered by the NGAS Operators, these should be reported to the NGAS Team to update the procedures accordingly. Procedures for recovering NGAS Hosts/an NGAS installation from a problem are contained in the Troubleshooting Chapter (chapter 4) of this manual.

3.1 NGAS Host Shut Down/Reboot Procedure

One of the operations, which an NGAS Operator has to carry out quite frequently, is to shut down an NGAS Host or to reboot it. This is done when disks are inserted and removed from the host, as well as an easy way to restart the server SW running on the host. The Reboot Procedure is as follows:

Procedure: Reboot NGAS Host

To shut down an NGAS Host, press the keys CTRL-ALT-DEL simultaneously. Shut Down NGAS Host The host will then subsequently perform a clean shut down and will turn itself off after completing the shutdown procedure. To switch on an NGAS Host simply press the power button.

The OS on the NGAS Host, subsequently prepares the host for operation, and starts up various servers, install drivers etc. as is normally done when booting up a Linux (UNIX) system. During this phase also the NG/AMS Server is started and will go Online automatically.

When the system is ready for operation the login prompt will appear. Switch on NGAS Host To check that an NGAS Host has initialized properly the easiest thing to use is the “Host Status Tool” in the NGAS WEB Site (section 5.6). The entry for the NGAS Host should indicate that the NG/AMS Server is Online and thus ready to carry out its tasks.

To check if the system is operating, it is also possible to issue a STATUS command to the NG/AMS Server, i.e.:

% ngamsCClient –port 7777 –host -status –cmd STATUS

Check that the output written on the terminal indicates that the NG/AMS Server is Online.

A system shutdown with the previous HW + OS configuration takes about 30 s. The same goes for the OS start-up procedure. The subsequent NG/AMS start-up procedure may take some longer since it takes a bit of time to mount the Data Disks.

3.2 Disk Preparation

The disk preparation should be done using the AHU. If Disk Sets are prepared consisting of two disks, a Storage Set defined as consisting of two disks should be used. If single Main Disks are prepared, a Storage Set should be used, which is defined as consisting of only one disk.

In the case of Two-Disk Storage Sets, the slots with odd numbers (1, 3, 5, 7) are normally used for the Main Disks, whereas the even numbered slots (2, 4, 6, 8) host the Replication Disks.

It is not possible at the moment to prepare Replication Disks without an accompanying Main Disk. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 11 of 45

The steps to carry out to prepare a disk are as follows:

Procedure: Disk Preparation

On the console connected to the NGAS Host, shut down the AHU (section 3.1). Prepare AHU Remove disks already inserted in this unit. Insert Disks to Prepare Insert the disks of the Disk Sets to be prepared. Boot AHU Switch on the AHU and wait until the machine has booted up. If there are no file systems already on the disks, log in as user “ngas” and execute the command:

% sudo ngasDiskFormat

Create file systems on the disks by entering the device name. If a file system already exists on the disk, the “ngasDiskFormat” tool will reject the request.

After the “ngasDiskFormat” tool has been executed, the disk is ready to be prepared by NGAS.

With the present HW, the mapping between the OS device name (“/dev/*”) and the NGAS mount point (“/NGAS/data*”) is as follows: File System Creation Device Name Mount Point /dev/sda1 /NGAS/data1 /dev/sdb1 /NGAS/data2 /dev/sdc1 /NGAS/data3 /dev/sdd1 /NGAS/data4 /dev/sde1 /NGAS/data5 /dev/sdf1 /NGAS/data6 /dev/sdg1 /NGAS/data7 /dev/sdh1 /NGAS/data8

The mapping shown, is valid for a system with 8 disks inserted.

Reboot the AHU (section 3.1). While the NG/AMS Server is going Online, it will register the disks inserted in the AHU in the NGAS DB.

If a disk is already registered in the NGAS DB its already existing entry in the NGAS DB is updated with the new location of the disks (Host Name, Mount Point, Slot ID).

Each disk is identified by a unique ID called the Disk ID. This is generated from information extracted from the BIOS of the disk and is allocated to the disk during its complete lifetime as an ‘NGAS disk’. NGAS Disk Registration When a disk is registered by NG/AMS, a small XML document named “NgasDiskInfo” is created under the Mount Point of the disk, e.g.:

% ls -l /NGAS/data4/NgasDiskInfo -rw-rw-r-- 1 ngas ngas 503 Apr 14 16:12 /NGAS/data4/NgasDiskInfo

This XML document contains the same basic information about the disk as written in the NGAS DB like the Logical Name (Disk Label), Disk ID, Completion Date and more. If the disk is put Online in an NGAS System (NGAS DB), where it was not yet registered, an entry is created in the NGAS DB based on the information in the “NgasDiskInfo” document.

Label Disks After the system has entered Online State, the new disks have been registered as Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 12 of 45

NGAS Data Disks and have been allocated a Logical Name (Disk Label Name). This Disk Label can be checked in the NGAS WEB Site (section 5.4).

A label can be printed in case a label printer is connected to an NGAS Host, using the Disk Label Tool (section 5.10). If a label printer is not available, print the labels manually, using any label printer device and stick them onto the disks. If it is needed to ship off the disks to a remote NGAS Site, this should be done normally in the water proof and shock resistant ‘Peli Cases’ (the black suitcases), which have been bought and have been prepared for this. This is the safest way of transporting them.

Disk Shipment If for some reason such a Peli Case is not available, the disks can be packed in carbon boxes and carefully wrapped into some shock absorbent material to avoid physical damages.

The disks should be sent with the Diplo-Bag Service when sent internally between the various ESO Sites.

Note that disks may also be prepared for local usage at each site.

3.3 Checking System Health During Operation

During operation of an Archiving Unit, it is important to monitor the health and condition of the system to ensure that data arrives at its predefined destination. If an Archiving Unit becomes in-operational, data might accumulate at the location providing data to NGAS. Depending on the buffer capacity of this system, this system may end up being saturated and further operation blocked.

In the case of the NGAS archiving systems at La Silla, Paranal and in Garching, the easiest way to check that the frames are being archived as they should, is to compare the OLAS DB (“observations.data_products”) with the NGAS DB (“ngas.ngas_files”). These two DBs should be in-sync what concerns the type of data archived into NGAS.

WEB based tools have been implemented to facilitate the checking of this. For the moment, these are provided for:

• MIDI: Check only for frames from the instrument MIDI. • VINCI: Check only for frames from the instrument VINCI. • VLTI: Check for frames produced by VLTI (MIDI + VINCI). • WFI: Check for frame produced by WFI.

These tools can be accessed via the NGAS WEB Site Main Page (section 5.2) under “Last Frames Archived”. Clicking this link brings up the “Last Frame Page” (section 5.9), which is updated automatically every 5 minutes.

If a frame is not registered in the NGAS DB but in the OLAS DB, the field with the value for the “NGAS DB” will turn red. This is an indication that the operator needs to intervene. Consult the chapter about troubleshooting to obtain more information, which may help in solving the problem.

Apart from this, no other actions are required during operation to verify the proper functioning of NGAS. Note that it is also important to monitor the condition of OLAS, in case DHS is used as data provider for NGAS. How to monitor the proper functioning of OLAS is another issue not dealt with in this manual.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 13 of 45

3.4 Handling of Disk Change at the Archiving Sites

During operation of an Archiving Unit, NG/AMS continuously monitors the free space on the Data Disks. If a disk is considered full, it will be marked in the NGAS DB as ‘completed’ and the “NgasDiskInfo” XML document on the disk updated accordingly. From that point on, NG/AMS will not try to store more data on that disk.

Before the disk gets completed, the system sends out a Disk Space Notification Message to the Subscribers of this event. Such a message looks as follows, e.g.:

Subject: NGAS-w2p2nau-7777: Notice: Disk Space Running Low Date: Sat, 19 Jul 2003 21:12:52 -0400 (SAT) From: [email protected]

Notification Message:

Disk with ID: IC35L080AVVA07-0-VNC400A4C16RZA - Name: ESO-ARCH-M-000099 - Slot No.: 3 - running low in available space (4996 MB)!

Note: This is just a notice. A message will be send when the disk should be changed.

Note: This is an automatically generated message

This message serves the purpose of reminding the NGAS Operators, that a Disk Set soon will become completed and it should therefore be checked if there is at least one other Disk Set with free disk space available for that Data Stream in the NAU.

When a Disk Set gets completed, a Disk Change Notification Message is sent to the subscribers of this event. Such a message looks as follows, e.g.:

Subject: NGAS-w2p2nau-7777: CHANGE DISK(S) Date: Sun, 20 Jul 2003 07:13:02 -0400 (SAT) From: [email protected]

Notification Message:

PLEASE CHANGE DISK(S):

Main Disk: - Logical Name: ESO-ARCH-M-000099 - Slot ID: 3

Replication Disk: - Logical Name: ESO-ARCH-R-000099 - Slot ID: 4

Note: This is an automatically generated message

This is an indication that this Disk Set is full and should be removed from the NAU and replaced by another set with free disk space. Changing disks, should usually be done during idle time, i.e. when no data is being produced.

When replacing Disk Sets, the following procedure should be applied:

Procedure: Disk Exchange

Shut Down the NAU Shut down the NAU applying the standard procedure (section 3.1). Remove Completed Unlock and remove the completed Disk Set(s) as indicated by the Disk Change Disk Set(s) Notification Message(s). Replace Completed Replace the completed Disk Set(s) in the NAU with new Disk Set(s), which are not yet Disk Set(s) completed. Power on the NAU and check in the Disk Status Tool (section 5.4) that the new disks Switch On the NAU are registered as Online and that the system seems to be operational as such. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 14 of 45

Shut Down the NBU Shut down the NBU applying the standard procedure. Insert the Replication Disk(s) of the completed Disk Set(s) into the NBU. The Replication Disks contain an “-R-“ in the Disk Label. Insert Replication Disk(s) in NBU If there are no free slots, remove the required number of completed Replication Disks from the NBU choosing the oldest one(s) first. Subsequently insert the new, completed Replication Disk(s) into the NBU. Power on the NBU and check in the NGAS WEB Interface if the disks inserted are Switch On the NBU correctly registered as Online (section 5.4). Place Replication Place the completed Replication Disk(s) possibly removed from the NBU in the Disk(s) in Disk Archive NGAS Disk Archive Storage, which should provide a safe place for storing these. Send Main Disk(s) to Send the completed Main Disk(s) to the Main NGAS Site (ESO HQ) in the Peli Cases HQ provided for this. This should be sent via the ESO ‘diplo-bag’. Check Stock of Data Check the stock of Data Disks at the site. If it seems as though the amount of disks on Disks stock is low, send a reminder to the NGAS Operators at HQ to request more disks.

Such a disk change, is a minor operation, which can usually be carried out within few minutes.

It is possible to use Disk Sets, which consists of disks of different size, e.g. to use 80 GB Main Disk together with 200 GB Replication Disks. If this is done, the number in the Disk Label of the two disks in the set will no longer be aligned. This in is without importance. Using such Disk Sets with differently sized disks, the disks will become completed individually and should be changed only when completed as indicated by the Email Notification Message.

The threshold configuration parameter in the NG/AMS Configuration, indicating when to send out a message indicating that the disk is becoming full, is called “NgamsCfg.Monitor:MinFreeSpaceWarningMb”.

The threshold parameter indicating to NG/AMS when the disk should be considered as full/completed in the NG/AMS Configuration, is called “NgamsCfg.Monitor:- FreeSpaceDiskChangeMb”.

To avoid shortage of disk space at the remote data production sites, it is recommended that HQ Operations and the NGAS Site Operators keep an eye on the remaining disk capacity at the Data Production Sites. This can be done using the NGAS WEB Interfaces (section 5.4), and at the Data Production Sites by checking manually the disks in stock. No special tools are supported for the moment to automatically remind the operators that disks should be requested. Note, it takes some time from ordering new disks from HQ till these disks actually arrive at the remote NGAS Site. This is even more true if it is necessary to purchase new disks. For this reason it should be attempted to have enough disks for at least 6 weeks of operation on stock at the Data Production Sites. It is up to the NGS Operators to estimate that this is the case.

3.5 Data Consistency Checking

The main tasks operating an Archive Facility are:

1. Operate and maintain the online archive facility to ensure that all data gets properly stored and registered.

2. Deliver data to the users/clients requesting this in an efficient manner.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 15 of 45

3. Ensure that the data in the data holdings are kept in a good condition, and are preserved as long as required. In principle it is desirable to keep the data forever, since also historical data is important for many projects.

The Data Consistency Checking feature of NG/AMS addresses point number three in the list above. Although the checking itself is automatic, some manual interventions might be required from the NGAS Operators side in order to carry out the checking. These are:

1. To insert the disks hosting data to be checked if this is not permanently online.

2. To deal with the possible Data Inconsistencies found if required.

It is attempted to reduce the work load produced by the Data Consistency Checking to a minimum, to avoid consuming too many resources for operating the Archive Facility.

For the Main Archive Facility at HQ, until now, all data is permanently online, and a Data Consistency Check is scheduled every two days. I.e., this runs in background and the NGAS Operator in this case only needs to react if Data Inconsistencies are found and a Data Inconsistency Notification Message is sent out.

The Data Consistency Checking is done using the AHU of the NGAS Site, to check data, which is not permanently online. The procedure to carry out the Data Consistency Check is as follows:

Procedure: Data Consistency Checking

Shut Down the AHU Shut down the AHU according to the standard procedure (section 3.1). Insert Disks to be Remove the disks possibly inserted in the AHU, and put these back in their permanent Checked storage location. Insert the next set of disks to be checked in the AHU. Power on the AHU. After the system has booted up and has mounted the disks, NG/AMS will identify that the disks are due for checking and will start to carry out the checking. Switch on AHU

In the present configuration (8 x 80 GB disks) the checking takes approximately 10 hours to carry out. After the checking has been carried out, the system will send out a Data Inconsistency Notification Message in case Data Inconsistencies were encountered. If a Notification Message was produced, the NGAS Operator should handle/resolve the problem(s) reported. Refer to the sub-sections below to learn how to deal with Data Inconsistencies.

Note that for the moment no message is send out when no inconsistencies were found. The operator thus has to wait enough time to ensure that the check is completed. As long as the checking is on-going, it is easy to see that there is a repeated/permanent disk access activity on the disk currently being checked.

Handle Possible Data If a disk activity LED is constantly on, this probably means that there is a problem with Inconsistencies Found that disk (bad blocks). Such bad blocks may mean that the system gets stuck and manual intervention is needed. Consult section 3.5.5 to get information about how to handle this situation.

In one of the next releases it will be possible to configure the system to send out an Email Notification Message also when no Data Inconsistencies were found, to explicitly indicate to the NGAS Operator that the check has finished and the next set of disks can be checked.

A feature to indicate the progress of an on-going checking will (probably) also be supported soon, to make it possible for the NGAS Operator to better control the checking and in case of problems with a storage media blocking the checking to intervene accordingly.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 16 of 45

Re-Check Disks if If Data Inconsistencies were encountered and were re-solved by the NGAS Operator, Problems Found + the disks should be checked again. To do this, simply reboot the AHU (section 3.1) Resolved and wait until the check has been executed. Remove Checked After a Data Consistency Check has been executed and in case no problems were Disks, Insert New encountered, the disks checked can be removed and the next set of disks scheduled for Disks checking inserted in the AHU.

The problems that can be encountered during a Data Consistency Checking and how to handle these problems, are described in the following sub-sections.

A basic rule in connection with the handling of Data Inconsistencies is that it is absolutely forbidden to execute any kind of file or directory deletion commands on the Data Disks (disks mounted under "/NGAS", e.g. "/NGAS/data*").

If is seems necessary to remove a file from a Data Disk, use the "mv" shell command to move to the following destination:

"/home/ngas/NGAS_MOVED_FILES_--

"

- e.g.:

"/home/ngas/NGAS_MOVED_FILES_2003-07-23".

A "README" file in this directory should contain a description of the files in the directory and why they were moved.

Make a back-up of the "/home/ngas/NGAS_MOVED_FILES-" directory to CD- ROM/DVD. Subsequently this directory can be deleted. This scheme is suggested to ensure that no data is lost.

In the future, NG/AMS will provide a better service for dealing with such files.

Since several people may be involved in carrying out the Data Consistency Checking, it is essential to carry out the checking in a systematic way and probably to keep a record of the progress.

To determine the order in which checking the disks should be checked, is best done by using the Disk Labels, i.e., the queue of disks for checking should be based on the FIFO principle so that the ‘oldest’ disks are checked first.

To ensure that the data holding is in a good condition it is desirable to carry out a check of each disk at least every six months.

The configuration parameter “NgamsCfg.FileHandling:DataCheckMinCycle” indicates how often a check should be scheduled on a media. This should be defined as 12 hours for an AHU to ensure that the checking is done immediately also, in the case of a re-checking. NG/AMS keeps track of when the last check was carried out and only schedules the next one when the next cycle is due, according to the configuration parameter.

3.5.1 Data Inconsistency: Checksum not Available

While a file is archived, NG/AMS generates a checksum for the destination file as it is stored on disk. This checksum is stored in the NGAS DB and used for the periodic Data Consistency Checking, to check that the file is still intact. If for some reason this checksum is not available in the DB, NG/AMS will generate it during the Data Consistency Checking and write it in the DB. The Email Notification Message which is sent out, has the following appearance (example):

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 17 of 45

Subject: NGAS-wlsahu-7777: DATA INCONSISTENCY(IES) FOUND Date: Fri, 25 Jul 2003 22:24:24 -0400 (SAT) From: [email protected]

Notification Message:

DATA INCONSISTENY(IES) FOUND IN DATA HOLDING: Date: 2003-07-26T03:34:59.632 NGAS Host: wlsahu Inconsistencies: 1200

Problem Description File ID Version ------NOTICE: Missing checksum - generated WFI.2002-01-01T05:41:42.838 1 NOTICE: Missing checksum - generated WFI.2002-01-01T05:43:28.352 1 NOTICE: Missing checksum - generated WFI.2002-01-01T05:45:14.937 1 ------

Note: This is an automatically generated message

This error should never occur. If this situation occurs this can either be an indication that something is wrong in the DB or that something was not working properly during the handling of the Archive Request. No further action is needed apart from ensuring that the NGAS Team is informed about this.

For the first generation of disks produced with NGAS, the checksum feature was not yet implemented and therefore the message will be generated the first time the disk is checked. However, since the feature has been supported for quite a while by now, it is not likely that it appears anymore.

3.5.2 Data Inconsistency: Not Registered Files Found

If files are found on a Data Disk, which are not registered in the NGAS DB and which are not one of the NGAS files/directories, a message of the following type is generated (example):

Subject: NGAS-ngasdev2-7777: DATA INCONSISTENCY(IES) FOUND Date: Mon, 28 Jul 2003 17:51:34 +0200 (MEST) From: [email protected]

Notification Message:

NOT REGISTERED FILES FOUND ON STORAGE DISKS: Date: 2003-07-28T15:49:22.799 NGAS Host: ngasdev2 Number: 1

Filename: ------/NGAS/data1/ThisFileShouldNotBeHere ------

Note: This is an automatically generated message

In order to deal with this problem the following procedure should be applied:

1. Determine if the file is a garbage file, which can be moved from the disk or not. In case of doubts, consult the NGAS Team. 2. If the file is a garbage file, DO NOT DELETE IT (section 2.3/”Manual Data Handling”). 3. Create a directory with the name "/home/ngas/NGAS_MOVED_FILES_--

", e.g.: "/home/ngas/NGAS_MOVED_FILES_2003-07-23". 4. Move the spurious file(s) into the directory created (use the “mv” shell command). 5. Update the README file in the temporary data directory created with explanations why the various files were moved away form the disk. Possibly Email Notification Messages could be pasted into the README file as well. 6. When other possible spurious files have been moved to the back-up directory, write the contents of this to a CD-ROM. 7. Delete the back-up directory ("/home/ngas/NGAS_MOVED_FILES_--
"). 8. Reboot the NGAS Host to launch the Data Consistency Checking again. 9. If no problems are reported the disks in the AHU are clean and can be moved back to their long-range storage location.

Better tools to support this procedure will be supported in a future release of NGAS.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 18 of 45

A more tricky case is if such a message is generated for something that looks like data files, e.g.:

Subject: NGAS-jewel71-7777: DATA INCONSISTENCY(IES) FOUND Date: Wed, 23 Jul 2003 17:35:05 +0200 (MEST) From: [email protected]

Notification Message:

NOT REGISTERED FILES FOUND ON STORAGE DISKS: Date: 2003-07-23T17:33:01.140 NGAS Host: jewel71 Number: 8

Filename: ------/NGAS/data7/saf/2001-05-08/4/TEST.2001-05-08T15:25:00.123.fits.Z /NGAS/data7/saf/2001-05-08/5/TEST.2001-05-08T15:25:00.123.fits.Z /NGAS/data7/saf/2001-05-08/6/TEST.2001-05-08T15:25:00.123.fits.Z ------

Note: This is an automatically generated message

In this case it is better to contact the NGAS Team to decide what to do. In the case above, apparently the files archived are some kind of test files and probably with no value. However, in order not to risk to loose any data, it is better to double- check/consult the NGAS Team to figure out which actions should be taken to resolve the situation.

3.5.3 Data Inconsistency: Checksum Illegal

If a file gets ‘corrupted’ due to deterioration of the storage material of the Storage Media or if somebody deliberately changes/edits the contents, the checksum of the file becomes invalid, and the following Email Notification Message is sent out (example):

Subject: NGAS-ngasdev2-7777: DATA INCONSISTENCY(IES) FOUND Date: Mon, 28 Jul 2003 11:46:53 +0200 (MEST) From: [email protected]

Notification Message:

DATA INCONSISTENY(IES) FOUND IN DATA HOLDING: Date: 2003-07-28T09:44:42.277 NGAS Host: ngasdev2 Inconsistencies: 1

Problem Description File ID Version Slot ID:Disk ID ------ERROR: Inconsistent checksum found TEST.2001-05-08T15:25:00.123 4 1:IC35L080AVVA07-0-VNC400A4G0KYAA ------

Note: This is an automatically generated message

The action to be carried out to handle this problem is simply to get a copy of the files from another NGAS Site, and to either:

1. Archive the file(s) in question into the NAU. This has the disadvantage of creating two copies of the file since a Main and a Replication File will be created. 2. If a spare disk is available, a Disk Set with a single disk (Main Disk) could be prepared in the AHU and the file archived onto this.

In general option 2 is preferred and each NGAS Site is requested to have such a single-disk Disk Set ready for this purpose.

When this type of inconsistency is encountered, the files will be marked as ‘bad’ in the DB and will not be taking into account by future Data Consistency Checking or when retrieving data files from NGAS.

It is advantageous if the operator also sets the Ignore Flag of the file (“ngas_files.ignore”) to 1 to indicate that is should not be attempted to access this file ever again. No tool exists at the moment to facilitate easy manipulation of the Ignore Flag though. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 19 of 45

3.5.4 Data Inconsistency: File in DB Not Found on Disk

If a file previously archived file is no-longer found on the disk as indicated in the NGAS DB, the following type of Notification Message is sent out (example):

Subject: NGAS-ngasdev2-7777: DATA INCONSISTENCY(IES) FOUND Date: Mon, 28 Jul 2003 11:57:18 +0200 (MEST) From: [email protected]

Notification Message:

DATA INCONSISTENY(IES) FOUND IN DATA HOLDING: Date: 2003-07-28T09:55:07.174 NGAS Host: ngasdev2 Inconsistencies: 1

Problem Description File ID Version Slot ID:Disk ID ------ERROR: File in DB missing on disk TEST.2001-05-08T15:25:00.123 4 1:IC35L080AVVA07-0-VNC400A4G0KYAA ------

Note: This is an automatically generated message

This problem could be encountered if a file has been deliberately/manually deleted by somebody (should never be the case of course), or if deterioration of the storage material of the Storage Media results e.g. in a ‘lost reference’ to that file.

The action(s) to take to resolve this problem is/are the same as for “Data Inconsistency: Checksum Illegal” (section 3.5.3).

3.5.5 Data Inconsistency: Disk Activity LED Permanently On

A special case is if a Storage Media is malfunctioning and therefore blocks the Data Consistency Checking, possibly the complete NGAS Host. This may happen if e.g. a Data Disk has Bad Blocks on it. Since the Data Consistency checking is a scan of almost the complete storage surface of the Storage Media, it is very likely that this problem will be encountered every now and then. Note that there is a well-estimated probability that this will occur for a disk, in particular as the disk gets more used/older.

The actions to carry out to address this problem are as follows:

1. Find the file that blocks the data checking (the NGAS Host) using the NGAS File Search Tool (section 5.7). The “File Status” field should have the value ‘01%’. If several files are found as a result of this query, this should be investigated as well. 2. Get the File ID + File Version of the files in question and request it from an alternative NGAS Site. 3. Mark the file to be ignored in the NGAS DB. No tool exists for the moment to support this action so it must be done with great care logging into the NGAS DB as user “ngas” and executing the following SQL statements:

begin transaction update ngas_files set ignore=1 where file_id=’’ and file_version= and disk_id=’’ commit transaction

4. Launch the checking of the disk again to see if the problem has been resolved. 5. When the file has been obtained from the alternative NGAS Site, it can either be archived onto the NAU or onto the AHU if this has a Single-Disk Disk Set available for this.

An NGAS tool to make the handling of this situation easier, will be made available for the operator in a future release.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 20 of 45

3.6 Data Migration

As part of the efforts to maintain a data holding in a good condition, all data should be migrated from the current media (disks) to new ones on a regular basis. Basically this consists in re-archiving all the data on to new media. The Clone Service of the NG/AMS Server is used for this purpose. Another purpose of this data migration is to compress data by using storage media with higher density and in this way reduce the number of media.

A migration of data to maintain data consistency, should take place every 3 years.

A procedure to do this, has been defined in Garching and is in use at HQ. This procedure however, is under refinement and a procedure suitable for usage also at the remote NGAS Data Production Sites is not yet available. Before this data migration is due at the Data Production Sites, a new and user friendly procedure will be put in place.

3.7 Creation of the Mondo Rescue/Installation CD

To be able to restore an operating NGAS System it is very important to have a Mondo Installation (Rescue) CD generated for that particular host. All the NGAS Hosts are very similar, but there will always be a few differences like different NG/AMS Configurations for different types of NGAS Hosts, network set-up, and a few other things. A Mondo Installation CD will install the NGAS Host from which it was generated exactly as it is. It is possible to re-install an NGAS Host from the Mondo CD within 15 minutes with very little intervention from the operators side.

The procedure to create the Mondo Image is very simple:

Procedure: Creation of Mondo Rescue CD

Log in as user “root” to the NGAS Host for which it is desirable to create the Mondo Log In to NGAS Host Image. A script is provided to generate the Mondo Image. This should be executed by issuing the command: Execute Mondo Image Creation Script # /opsw/packages/bin/make-mondo-all

- on the Linux shell. When the Mondo Image is ready, the image file can be found at the location:

Write Mondo Image “/root/images/1.iso” on a CD-ROM Transfer the image file to a computer with access to a CD-ROM writer and write it on a CD-ROM.

For further information about the Mondo Tool, refer to: http://www.microwerks.net/~hugo. The Mondo Tool seems to be the best option for the moment to handle such a quick and easy restoring of a system.

The tool provided by the NGAS project to generate Mondo Rescue Images is called “make-mondo-all”. It performs an unmounting of NGAS Data Disks and carries out a clean-up of the NGAS Host. If the NG/AMS Server is running on the host when the script is launched, the server is first shut down and subsequently started again after the creation of the Mondo Rescue CD has terminated.

Don’t ever abort the make-mondo-all script while a Mondo Rescue Image is being generated. This will leave the system in an undefined condition. If this happens, contact

the NAGS Team..

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 21 of 45

3.8 Installing NGAS Host with Mondo Rescue Image

Installing an NGAS Host with a Mondo Installation CD is a simple process, which can be carried out within approximately 15 minutes. The procedure defined for this is:

Procedure: Mondo Rescue CD Installation

Shutdown the Host Shut down the host (section 3.1). Unlock Possible Data Unlock/take out possible NGAS Data Disks which might be inserted in the system. Disks For safety reasons it is recommended to unplug the network connection from the Disconnect NGAS Host NGAS Host before initiating the installation of the Mondo Rescue Image. This is done from Network to avoid that the host e.g. comes up with an undesirable host name. Insert Mondo Rescue Insert the Mondo Rescue CD in the CD-ROM drive in the NGAS Host. To do this, CD in CD-ROM simply boot up the host and insert the CD. Reboot the host (section 3.1). Type “nuke” + enter at the prompt appearing after the Reboot NGAS Host host has booted from the Mondo Rescue CD. Answer Trivial Questions During Answer some trivial questions at the end of the procedure. Installation After the Mondo Rescue installation has been completed, the NGAS Host should be Reboot the Host rebooted. If the NGAS Configuration Assistant prompts for information, feed the proper information to the Configuration Assistant, to set up the system properly. Feed Configuration Assistant with The NGAS Configuration Assistance is a small tool, which queries hostname, network Information Requested parameters and some other parameters for the NGAS host and updates the system configuration files accordingly. This tool will only be launched once, at the first reboot after a new installation.

When the installation and configuration of the host seem to be completed, the network can be plugged into the system again. Plug In Network To see if the system is properly configured try to log into another host from the newly installed host. Try also to log into the newly installed host from another host. If Data Disks were removed from the host at the beginning of the procedure, shut Insert Data Disks down the host and insert the disks again. Subsequently boot up the host again. Check System After the system has booted up, check in the NGAS WEB Interfaces that the NG/AMS Condition Server is running as expected.

The complete installation with the additional step to possibly insert Data Disks, can be carried out in approximately 20 minutes, which is thus the time it takes to e.g. transform an NBU into an NAU.

A Mondo Rescue CD contains a bootable image of the host from which it was generated. In order to ensure that the system boots from the CD, the system BIOS should be configured accordingly, so that it is first tried to boot from the CD-ROM and subsequently from the system disk.

When the re-installation from the Mondo Rescue CD has been initiated, all data on the system disk will be erased. No back-up is created. Make sure that no data, which Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 22 of 45

shouldn’t have been deleted are lost in this way. In general it is recommended not to store any value data on the system disks of the NGAS Hosts; see also section 2.3/“Temporary/Work Files/Directories”.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 23 of 45

4. TROUBLESHOOTING PROCEDURES

The purpose of the Troubleshooting Procedures is to recover an NGAS Host or an NGAS System for a serious problem causing it to malfunction. The Troubleshooting Procedure are mostly directed towards the NGAS operation at the Data Production Sites where it is desirable to enable the NGAS Operator to resolve problems in an autonomous manner and as quickly as possible. Problems encountered operating the NGAS Cluster at HQ are not that critical to solve in a very short time and more resources are available to deal with such problems.

If new cases (the new types of problems) are discovered, not yet covered in this manual, it is very important to report such problems to the NGAS Team to have a proper procedure established to handle these and to have it documented.

If as a result of troubleshooting/problem solving, changes are introduced in the NG/AMS Configuration or in any other part of the configuration of an NGAS Host, it is very important that the NGAS Operator informs the NGAS Team about this change.

In general, no changes should be introduced in the configuration of an NGAS Host without prior consultation with the NGAS Team. Only if an urgent problem needs to be solved to ensure continued operation and if it is not possible to get in contact with the NGAS Team changes can be introduced. The NGAS Team in any case, should be informed about the changes introduced.

4.1 General Lines of Direction

The easiest manner in which to detect a problem in the handling of data archiving, is to use the Last Frames Archived WEB interface (section 5.9). If the NGAS field turns red, it means that there is a problem either with NGAS (the NAU) or a problem for instance in connection with the provider of data to NGAS (e.g. DHS) to deliver frames to NGAS. In general, a problem can be located in one of the following places:

1. NGAS Host (HW or OS). 2. NG/AMS Server. 3. Provider of data to NGAS (e.g. DHS). 4. DBMS. 5. Network.

To identify properly from where the problem origins, may require some experience from the NGAS Operator’s side and is also not easy to document in a manual like this.

In general if an NGAS Host starts to mal-function the first basic action that can be taken to attempt to resolve the problem, is to reboot the NGAS Host (section 3.1).

If this does not resolve the problem, it is necessary to investigate further and to find the origin of the problem. The following sections provide various information, which can be used to address and identify problems.

If rebooting the NGAS Host resolved the problem it is advantageous if the NGAS Operator sends the NG/AMS Log File (section 4.5) to the NGAS Team (section 1.4) to make it possible to investigate the cause of the error.

In general, when a problem has been resolved, it is better if the NGAS Operator can provide log files and other information to the NGAS Team to make it possible to analyze the problem.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 24 of 45

4.2 Check that NG/AMS Server is Running Properly

To check if an NG/AMS Server is running properly on an NGAS Host, either the Host Status Tool (section 5.6) can be used, or in case of doubts it is also possible to log into the NGAS Host and to issue a standard “ps” shell command to verify if the server seems to be running as it should:

[ngas@ngasdev2 ngas]$ ps -efww |grep ngams ngas 1293 1036 0 08:30 pts/5 00:00:01 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 1311 1293 0 08:31 pts/5 00:00:00 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 1312 1311 0 08:31 pts/5 00:00:05 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 1313 1311 0 08:31 pts/5 00:00:08 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 1314 1311 0 08:31 pts/5 00:00:00 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force [ngas@ngasdev2 ngas]$

When executing this command, a number of processes are listed. This is due to the fact that each thread running within the process context of the server is displayed as an individual process. Normally five processes should be displayed if the server is Online/Idle, these are:

1. Main Thread: The main command handling thread of the NG/AMS Server. 2. Python Thread Manager Thread: Internal thread to handle the threading environment in Python. 3. Janitor Thread: The sub-process taking care of cleaning up the NGAS environment and updating DB and DB Snapshots. 4. Data Consistency Checking Thread: The sub-process that carries out the Data Consistency Checking periodically. 5. Data Subscription Thread: The thread taking care of delivering data to possible Subscribers of this

If the server is handling a request from a client, more than five processes will be shown. When the server finishes processing a request, the thread created to handle the request will be removed and subsequently only the five sub-processes listed above will remain if the server is not handling another request.

Other ways to check if the server is running properly, is to issue commands directly to the server and to analyze the response from the NG/AMS Server, see section 4.4.

4.3 Starting/Stopping the NG/AMS Server Manually

The NG/AMS Server is usually started/stopped automatically when an NGAS Host boots up and is shut down. For troubleshooting purposes it might be useful to be able to start/stop the server without having to boot the NGAS Host.

To start the server in background mode, the NG/AMS Server Init Script can be used by issuing the following command as user “ngas”:

% /etc/init.d/ngamsServer start

To stop the server one of the three following methods can be used:

1. Use NG/AMS Start-Up Script: As user “ngas” launch the NG/AMS Init Script:

% /etc/init.d/ngamsServer stop

2. Issue OFFLINE and EXIT Commands: By issuing the NG/AMS commands “OFFLINE –force” and “EXIT” the NG/AMS will terminate execution, e.g.:

[ngas@ngasint ngas]$ ngamsCClient -port 7777 -host ngasdev2 -status -cmd OFFLINE -force [ngas@ngasint ngas]$ ngamsCClient -port 7777 -host ngasdev2 -status -cmd EXIT Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 25 of 45

3. Kill at OS Level: If none of the previous suggested methods to stop the server work, it is possible to kill the NG/AMS Server at OS level by querying the PID and subsequently killing it. First a ‘soft kill’ should be tried (“kill ”), if that does not work a hard kill could be tried (“kill -9 ”). Note, it might be necessary to kill all processes in the context of the NG/AMS Server, but usually it is enough to kill the main process (with the lowest PID). E.g.:

[ngas@ngasdev2 ngas]$ ps -efww |grep ngams ngas 2162 1036 2 13:22 pts/5 00:00:01 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 2180 2162 0 13:22 pts/5 00:00:00 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 2181 2180 0 13:22 pts/5 00:00:00 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 2182 2180 0 13:22 pts/5 00:00:00 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force ngas 2183 2180 0 13:22 pts/5 00:00:00 python ngamsServer.py -v 3 -cfg /tmp/ngamsServer.conf -autoonline -force [ngas@ngasdev2 ngas]$ kill 2162

After having killed the server the NGAS Host is not operational anymore. After having done such manual stopping and start- up of the server, the NGAS Host should be rebooted to start up in the normal way.

4.4 Command Level Server Interaction

Since the communication protocol of the NG/AMS Server is based on HTTP, it is easy to interact with the server for instance from a standard WEB browser, or by using another tool to issue and handle HTTP requests, e.g. “wget” if installed.

To check explicitly if an NG/AMS Server is capable of handling commands (HTTP requests), a STATUS command can be issued, e.g.1:

1 Note that whether it is possible to display the reply from the NG/AMS Server, which is based on XML, depends on how the WEB browser used is configured to handle mime-types, and the capabilities of the WEB browser. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 26 of 45

From the output above it can be concluded that the server is capable of handling commands and that it is Online, which means that commands can be handled as such. Of course, there might be other problems blocking a proper handling of a command.

To communicate with the server it is also possible to user the two command utilities provided with the NG/AMS Package. These are the NG/AMS C-Client (ngamsCClient) and the NG/AMS P-Client (ngamsPClient). These two tools and “wget” are installed on all NGAS Hosts. They can be provided also on Solaris/HP-UX if needed, and installed on other hosts to facilitate communication with the NG/AMS Server.

To issue to the STATUS command using the NG/AMS C-Client enter the following command on the shell:

[ngas@ngasint ngas]$ ngamsCClient -port 7777 -host ngasdev2 -status -cmd STATUS

4.5 Inspection & Interpretation of the NG/AMS Log File

A service has been implemented to retrieve the NG/AMS Server Log File remotely via the HTTP command interface. The URL to do this is: “http://:/RETRIEVE?ng_log,“ e.g.:

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 27 of 45

In this way it easy to obtain a copy of the log file for further investigation of a problem. A copy of the log file is also useful for the NGAS Team if it becomes necessary to assist in solving a problem. Having retrieved the log file as indicated above, it is easy subsequently to send it to the NGAS Team (section 1.4). Note that the log file obtained is a snapshot of the log file, made when the command is issued.

For all operational NGAS Hosts, the NG/AMS Log File is archived on the NAU on a daily basis. It is therefore possible to retrieve earlier versions of the log files for investigation of problems if necessary.

It is also possible to inspect the log files directly on the NGAS Host by logging in as user “ngas”, using the command:

% tail –f /NGAS/ngams_staging/log/LogFile.nglog

In this way log entries are printed to the terminal as they are produced by the NG/AMS Server.

If needed it is possible to increase the Log Level, which determines the amount of logs produced by the NG/AMS Server. This is a number in the interval 0 to 5 whereby 0 indicates no logging and 5 is the highest level, which produces a very detailed log output. Changing the Log Level can be done in two ways:

1. Online via an HTTP Request: It is possible to change the Log Level while the server is running by means of the following URL:

http://://CONFIG?log_local_log_level=1

2. Via the NG/AMS Configuration: In this case the parameter “NgamsCfg.Log:LocalLogLevel” should be adjusted. It is strongly recommended not to change this parameter in the configuration since increasing this on a permanent basis may result in a huge amount of logging information produced. Only to maybe monitor the system better for a limited period of time in connection with a problem that may appear randomly, this may be used.

The format of the contents of the log file is defined as follows:

[] [:::<>:]

- e.g.:

2003-07-17T11:31:00.420 [INFO] Received command: OFFLINE [ngamsCmdHandling.py:cmdHandler:42:4421:Thread-5]

The can take the following values: “EMERGENCY”, “ALERT”, “CRITICAL”, “ERROR”, “WARNING”, “NOTICE”, “INFO” and “DEBUG”. The most commonly used ones by NG/AMS, are the ones in bold/italics font.

As no dedicated tool is supported for the moment for browsing the NG/AMS Log File, to scan a log file e.g. for logs of type “ERROR” can most easily be done using the “grep” shell command, e.g.:

[ngas@ngasdev2 ngas]$ grep ERROR /NGAS/ngams_staging/log/LogFile.nglog 2003-07-18T13:20:58.870 [ERROR] Error occurred while bringing the system Online: NGAMS_AL_NO_DISKS_AVAIL:3013:ALERT: ALERT -- NO DISKS AVAILABLE IN THIS NGAS SYSTEM! NGAS ID: NGAS-ngasdev3-7777. Host ID: ngasdev3. Check HW and try again! [ngamsSrvUtils.py:handleOnline:313:690:MainThread] 2003-07-29T08:30:44.160 [ERROR] NGAMS_ER_ILL_CMD:4016:ERROR: Illegal command: cfg received. Rejecting request! [ngamsServer.py:reqCallBack:1035:1292:Thread-1] 2003-07-29T13:53:18.830 [ERROR] Error occurred while sending reply to: ('134.171.27.39', 2242). Error: [Errno 21] Is a directory [ngamsServer.py:reqCallBack:1035:2312:Thread-11]

This is not the most user friendly way to check the log file, but in the near future hopefully better tools can be provided.

4.6 Execution of NG/AMS Server in Verbose Mode

To analyze in detail if an NG/AMS Server is operating properly, it is possible to run it in Verbose Mode and to inspect the output written in the terminal in which the server is launched. In order to do so, the NG/AMS Server should first be stopped if running. Checking if the NG/AMS Server is running is explained in section 4.2. If the server is running it can be stopped by applying the guidelines given in section 4.3.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 28 of 45

When it is ensured that the server is not running, it is possible to start the server in Verbose Mode, e.g.:

[ngas@ngasdev2 ngamsServer]$ python ngamsServer.py -v 1 -cfg /tmp/ngamsServer.conf -autoonline -force 2003-07-29T13:28:26.230:ngamsServer.py:init:1464:2209:MainThread:INFO> NG/AMS Server version: v2.1.4/2003-07- 14T17:16:20 2003-07-29T13:28:26.230:ngamsServer.py:init:1465:2209:MainThread:INFO> Python version: 2.1 (#1, May 17 2001, 10:44:27) [GCC 2.96 20000731 ( Linux 7.1 2.96-81)] 2003-07-29T13:28:26.230:ngamsServer.py:handleStartUp:1527:2209:MainThread:INFO> Loading NG/AMS Configuration: /tmp/ngamsServer.conf ... 2003-07-29T13:28:26.330:ngamsServer.py:handleStartUp:1529:2209:MainThread:INFO> Allow Archiving Requests: 1 2003-07-29T13:28:26.330:ngamsServer.py:handleStartUp:1531:2209:MainThread:INFO> Allow Retrieving Requests: 1 2003-07-29T13:28:26.330:ngamsServer.py:handleStartUp:1533:2209:MainThread:INFO> Allow Processing Requests: 1 2003-07-29T13:28:26.330:ngamsServer.py:handleStartUp:1575:2209:MainThread:INFO> Logging properties defined as: Sys Log: 1 - Sys Log Prefix: DFSLog - Local Log File: /NGAS/ngams_staging/log/LogFile.nglog - Local Log Level: 3 - Verbose Level: 1 2003-07-29T13:28:26.330:ngamsServer.py:handleStartUp:1595:2209:MainThread:INFO> Connecting to DB (DB: TESTSRV - User: ngas) ... 2003-07-29T13:28:26.340:ngamsDb.py:__init__:240:2209:MainThread:INFO> Using Sybase module version: 0.13 … 2003-07-29T13:28:28.650:ngamsServer.py:handleStartUp:1630:2209:MainThread:INFO> Initializing HTTP server ... 2003-07-29T13:28:28.650:ngamsServer.py:serve:1654:2209:MainThread:INFO> NG/AMS HTTP Server ready (Host: ngasdev2 - Port: 7777) ... 2003-07-29T13:28:30.950:ngamsDataCheckThread.py:dataCheckThread:107:2229:DATA-CHECK-THREAD:INFO> Data Check Thread starting iteration ... 2003-07-29T13:28:32.730:ngamsDataCheckThread.py:dataCheckThread:363:2229:DATA-CHECK-THREAD:INFO> NGAMS_INFO_DATA_CHK_STAT:3020:INFO: Number of files checked: 7. Amount of data checked: 0.357 MB. Time for checking: 1.747s.

Note that the Verbose Output shown above have been shortened considerably.

It should be considered probably to use a higher Verbose Level as the one used above (=1) to get more information about the execution. A value os 3 seems often to be appropriate for normal troubleshooting.

When commands are issued or when the NG/AMS Threads are active, this can be monitored in the terminal with the NG/AMS Server running in Verbose Mode.

To terminate a server running in Verbose Mode, the methods described in section 4.3 can be used or, even easier, CTRL-C (^C) can be pressed in the terminal where the server is running.

4.7 Inspection of NG/AMS Configuration

If it is desirable to inspect the NG/AMS Configuration to understand the configuration of the NG/AMS Server, this can be done in two ways:

1. Retrieve via HTTP Interface: The URL to retrieve it is:

http://:/RETRIEVE?cfg

2. Log into NGAS Host: It is also possible to log into the NGAS Host as user “ngas” and to access the file directly. The file is stored at the following location:

/NGAS/ngams_staging/log/LogFile.nglog

If option 2 is used, note that as user “ngas” it is not possible to change the configuration and it should be avoided to log in as user “ngasmgr” to change it.

4.8 NGAS DB not Updated in NGAS WEB Interfaces

If in the NGAS WEB Interfaces, entries for disks (e.g. section 5.4) , archived files (e.g. section 5.7) or information about an NGAS Host (section 5.6) apparently, are not updated properly this means that the information is not updated correctly in the NGAS DB at the HQ. This can be due to

1. NG/AMS has problems accessing the local DB, the local DB server could be down or not working properly. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 29 of 45

2. In the case of a remote NGAS Site, that the DB replication from the DB of the remote site to the NGAS DB at HQ is not working properly.

To find out which of these two cases apply, the NG/AMS Log File could be scanned for errors accessing the local NGAS DB (section 4.5):

1. If no errors are found indicating such problems and the server seems to be running properly (section 4.2), the problem seems to be located in the HQ NGAS DB or in the replication between the remote site and the HQ NGAS DB. In this case, simply inform the NGAS Team about the problem (section 1.4). 2. If errors are found, which could indicate a problem in accessing the DB, the local DBMS should be checked independently of NGAS to see if it is working properly. The NGAS Host should also be rebooted.

4.9 Disk Activity Lamp Permanently On

If an NGAS System seems not be to be working properly and if one of the LEDs on the Disk Cases (Disk Activity Lamps) seems to be permanently on, this is an indication that there is something wrong with that disk. This is usually due to a bad blocks on the disk, which might block the system in an I/O action on that disk.

To recover from this problem the easiest is to take out the given Disk Set from the machine, in particular for an NAU, to continue normal operation as soon as possible.

If a new Disk Set is inserted in the same slots where the disks that were previously removed were residing, and a disk in the same slot which previously had problems shows the same behavior, the problem could be caused by faulty HW in the NGAS Host, probably related to the internal disk controller controlling the 8 removable disk slots. In this case the two slots of the Disk Set should not be used, and the unit should probably be taken out of operation for maintenance. I.e., in the case of an NAU, the NBU should be converted into an NAU (section 3.8).

In any case the NGAS Team should be reported to agree upon what to do to deal with the problem.

4.10 NGAS WEB Service not Working Properly

If the NGAS WEB Site cannot be accessed successfully, this is normally caused by two things:

1. Network Connection to HQ Not Available: To test this, another service residing at the HQ could be accessed from the remote site to test if there is connection between the site and HQ. If this is not the case, it will be necessary to wait until the connection to HQ has been restored. Possibly the people responsible for the link to HQ should be informed. 2. NGAS WEB Site Server Down: If the problem seems to be that the NGAS WEB Server running at HQ is not running properly or not running at all, the NGAS Team should be informed. A service to carry out a remote restart is not provided.

4.11 Error Message Indicates that No Storage Sets are Available

If the following type of Email Notification Messages are sent out by NGAS:

Subject: NGAS-ngasdev2-7777: NO DISKS AVAILABLE Date: Tue, 29 Jul 2003 16:51:34 +0200 (MEST) From: [email protected]

Notification Message:

NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: image/x-fits

Note: This is an automatically generated message

Subject: NGAS-ngasdev2-7777: PROBLEM ARCHIVE HANDLING Date: Tue, 29 Jul 2003 16:51:34 +0200 (MEST) From: [email protected] Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 30 of 45

Notification Message:

NGAMS_ER_ARCHIVE_PUSH_REQ:4011:ERROR: Problem occurred handling Archive Push Request! URI: SmallFile.fits.

Note: This is an automatically generated message

- and/or if the following type of log entry is found in the NG/AMS Log File (and in the log file of “ngamsIngest” running on the DHS host, if this is the data provider for NGAS):

NGAMS_WA_BUF_DATA:4022:WARNING: Problems occurred while handling file with URI: SmallFile.fits. Data will be Back-Log Buffered, and attempted archived at a later stage. Previous error message: NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: image/x-fits. [ngamsServer.py:reqCallBack:1035:2363:Thread-1

- this is an indication that the NG/AMS Server does not find an available set of disks to handle the Archive Request. In this case, the file will be Back-Log Buffered and it will be attempted to handle the files at a later stage. It goes without saying, that to resolve this problem, new Disk Sets should be inserted in the NAU (section 3.4).

It important to react promptly to this situation, since the buffer capacity on NGAS Host for Back-Log Buffering data is limited.

If the Back-Log Buffer on the NGAS Host is saturated, the data will be buffered on the host hosting the provider of data to NGAS (e.g. DHS).

For the moment, the Email Notification Messages will be send out each time it is tried to archive a file. Also for this reason it is important to resolve this problem as quickly as possible.

In a future release the repeated distribution of this message could be suppressed in this case to avoid to overflow the mail system/mail boxes with this kind of email messages.

In addition, probably only one of the two messages above should be distributed.

4.12 Problems Accessing DB, Data Back-Log Buffered

If the following type of Email Notification Message is sent out by NGAS:

Subject: NGAS-ngasdev2-7777: PROBLEM ARCHIVE HANDLING Date: Tue, 29 Jul 2003 17:26:29 +0200 (MEST) From: [email protected]

Notification Message:

NGAMS_ER_ARCHIVE_PUSH_REQ:4011:ERROR: Problem occurred handling Archive Push Request! URI: SmallFile.fits.

Note: This is an automatically generated message

- and/or if the following type of log entry is found in the NG/AMS Log File (and in the log file of “ngamsIngest” running on the DHS host if this is the data provider for NGAS):

NGAMS_WA_BUF_DATA:4022:WARNING: Problems occurred while handling file with URI: SmallFile.fits. Data will be Back-Log Buffered, and attempted archived at a later stage. Previous error message: NGAMS_ER_DB_COM:2002:ERROR: Problems communicating with the DB: Error: connection is not open.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 31 of 45

- this is an indication that the NG/AMS Server has problem accessing the NGAS DB. In this case the procedure to recover is as follows:

1. Verify that NGAS DB Accessible: Verify by connecting directly to the DB Server in use (e.g. by using “isql”) that the NGAS DB is accessible. If this is not the case, contact the responsible for the operation of the DBMS on the given site. 2. Reboot NGAS Host: Reboot always the NGAS Host to see if this resolves the problem.

Read also the consideration with respect to the NGAS Back-Log Buffer described in section 4.11.

4.13 NAU System/HW Failure

If the NAU seems to have severe problems causing it to malfunction either due to HW or due to the system (OS), the following can be tried out

1. Swap NAU with NBU: Re-install the NBU with the Mondo Rescue CD generated for the NAU to convert it into an NAU (section 3.8). 2. Analyze Problem: Take the NAU out of operation, and convert the NAU into an NBU by installing it with the Mondo Rescue CD of the ‘normal’ NBU. The problem can now be analyzed offline with interfering with the operation/archiving of data.

If it seems as though the problems with the faulty unit persist, it may be necessary to do HW intervention. In any case the NGAS Team should be informed and will indicate how to address the matter.

If the previous faulty host does not give any problems, re-installing the machine may have been the solution. It could be tried to swap back the machines at a later stage.

4.14 NGAS Host Acoustic HW Alarm Sounding

If an NGAS Hosts starts ‘beeping’ this is an indication that there is a problem with the HW. The HW alarm can be triggered by two main events:

1. Power Supply Faulty or Off: On the back-plane of an NGAS Host, two power supply units can be seen. If one of these is switched off (should not be the case!), the system will issue the acoustic alarm. Simply switch it on again. If both power supplies seem to be switched on, it might be that one of the power supplies is mal-functioning. This unit should be removed and replaced by a working unit. Note, that an NGAS Host may be able to continue to operate with only one power supply. To determine if the alarm is related to a power supply, push the red button above the power supply units. If the alarm stops the error origins from one of the power supplies. 2. Internal Ventilator Faulty: If one of the internal ventilators in an NGAS Host starts to mal-function, e.g. if the rotations per time unit is too low, the alarm will also become active. It is important to switch off the host immediately after having identified which ventilator is mal-functioning, and replace this unit before the HW might be damaged. Try to identify the faulty unit by opening the NGAS Host and inspecting the various ventilators.

It is suggested to have some spare power supplies and spare ventilators to be able to fix simple HW failures on the site without having to send the faulty NGAS Host for repair to HQ.

Normally switching off the alarm using the ‘red buttons’ on the back plane, should be exercised with great care, as if the alarm is really indicating a severe problem, this might lead to severe damages in the HW of the NGAS Host.

If one of the internal temperature sensors indicate that an essential temperature is too high, the mother board of the NGAS Host may switch off automatically. It could be tried to switch on the unit after leaving it for a while to cool. If it switches off again, there might be a serious HW problem with the unit.

In case it seems difficult to handle the HW problem on the remote NGAS Site, the NGAS Team should be consulted to decide whether to send back the unit to HQ for repair.

If the faulty unit is an NAU, this should be replaced by another unit. See section 3.8 for a Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 32 of 45

description how to do this.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 33 of 45

5. NGAS WEB SITE

The purpose of the NGAS WEB Site is to support the NGAS Operators carrying out the day-to-day operations but are also used for troubleshooting to check the state of the system.

The main NGAS WEB server can be accessed following the URL:

http://jewel1.hq.eso.org:8080/NGAS

Note, this site is only accessible for internal ESO users.

In the following sections the various tools provided are briefly described.

5.1 NGAS WEB Site Main Entry Page

When issuing the URL of the NGAS Zope WEB Server, the following page is displayed:

The most significant fields/links in this page are:

Field/Link Description ESOECF Garching Link to the part of the interfaces accessing the NGAS DB at HQ. OLASLS La Silla Link to the part of the interfaces accessing the NGAS DB at La Silla.2 Zope Backup Servers Link to alternative NGAS WEB Servers. This service is not available for the moment

2 Note, there is no support for accessing directly the NGAS DB at Paranal. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 34 of 45

but may be re-enabled later.

5.2 NGAS WEB Site Main Page/ESOECF

The services provided by the NGAS WEB Interface accessing the NGAS DB at HQ (“ESOECF”), has the following appearance3:

3 Only the actual contents of this page is shown here to save space. Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 35 of 45

The links in this page should be self-explanatory if the descriptions in the page are read.

5.3 NGAS WEB Site Main Page/OLASLS

This interface is almost the same as the ESOECF one. Only difference is, that the DB queries are carried out in OLASLS and not in ESOECF in Garching. See description of ESOECF site in the section 5.2.

5.4 Disk Status Tool

The Disk Status Tool is used to query information about the various NGAS Disks known to the system:

The most significant fields and links in this page are:

Field/Link Description Tool Bar: Define new Query Using this tool it is possible to refine the query to restrict the disks shown to e.g. only Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 36 of 45

show disks from a specific host. All Disks Show all disk. This is the default when entering the Disk Status Tool. All Main Disks Show all Main Disk known to this NGAS installation. All Replication Disks Show all Replication Disks known to this NGAS installation. Show all disks currently mounted in an NGAS Host. I.e., disks that are offline are not Mounted Disks shown. Show disks stored at La Silla. This goes both for free Disk Sets and for Replication La Silla Storage Disks which are kept at La Silla. Show disks stored at Garching. This goes both for free Disk Sets and for Garching Storage Main/Replication Disks which are kept at HQ. Show disks stored at Paranal. This goes both for free Disk Sets and for Replication Paranal Storage Disks which are kept at Paranal. Traveling Indicates that the disk it underway from the HQ to one of the NGAS Sites. It is not known to NGAS where the disk is.

Unknown A disk should never be permanently in this state. If there such entries are contained in the list, it should be investigated where these disks are located

and why they are not properly registered.

Disk Table: The Disk Label (also referred to as Logical Name) of the disk. By clicking on this link, Label the complete information stored in the NGAS DB for that disk is shown. Full Indicates to which degree the disk is full. This is given in percentage. Number of file stored on the disk. By clicking on this number, a page is displayed Number of Files listing information for all the files stored on the given disk. This field is 0 is the disk is not yet completed and 1 if NGAS has identified as being Completed completed. Slot The slot in the NGAS Host in which the disk is residing. This field gives the NGAS Host name and the mount point in case the disk is Online. In Status case the disk is not Online, this field indicates the location. Pressing the button it is possible to set the status/location of the disk manually. The page to update this field is shown further on in this section. Update This field must always be updated manually if the disk is kept offline for long-range storage or if it is shipped between the NGAS Sites.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 37 of 45

The page to update the disk status/location manually has the following lay-out:

The user should choose the proper value describing the status of the disk and press the “Update” button to update the value in the NGAS DB.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 38 of 45

5.5 NGAS Disk Status Form

The Disk Status Form is used to query and display information in an easy manner for disks inserted in a specific NGAS Host:

In the option menu the host should be chosen and the “Submit Query” button pressed to initiate the query. An example of the resulting page is shown in the following.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 39 of 45

Example of the Disk Status Table showing the disks currently Online in a specific NGAS Host:

Clicking on the Label link, more detailed information about the disk is displayed. Clicking on the link on the number of files, a table is displayed with more information about the files stored on the given disk.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 40 of 45

5.6 Host Status Tool

The Host Status Tool is used to display the status for each NGAS Host within the NGAS Site. It indicates amongst others, the state of the NG/AMS Server and the services provided by the NGAS Host:

The various fields/links in this page are:

Field/Link Description The host ID allocated to the NGAS Host. By clicking on the link, the complete Host ID information about the host is displayed. Domain Domain name. NG/AMS Port Port number used by the NG/AMS Server for receiving commands (HTTP requests). Cluster Name Name of the cluster within which context the host is operating. For individual hosts like Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 41 of 45

NAU, NBU and AHU, this is the name of the host itself. For other NGAS Hosts operating within a cluster, the Cluster Name is the name of the unit to be contacted to access the given host. Version of the NG/AMS SW. This field is updated when the NG/AMS Server goes NG/AMS Version Online. Indicates if the NGAS Host accepts Archive Requests. This field is updated when the NG/AMS Archive NG/AMS Server goes Online. Indicates if the NGAS Host accepts Retrieve Requests. This field is updated when the NG/AMS Retrieve NG/AMS Server goes Online. Indicates if the NGAS Host accepts Processing Requests. This field is updated when the NG/AMS Process NG/AMS Server goes Online. If a Data Consistency Checking is currently being executed on the NGAS Host this flag NG/AMS Checking is 1, otherwise if the Data Consistency Checking is inactive, this is 0. State of the NG/AMS Server. The server is operational if the state is Online, this should NG/AMS State be the normal value displayed. If the NGAS Server is suspended (to save power + spare the HW), this field is set to 1. Host Suspended Otherwise it is 0 (or empty). If the NGAS Host is suspended, this field indicates which NGAS Host is requested to Wake-Up Server wake up this host in case of incoming requests or if the requested Wake-Up Time is meat. Time a suspended NGAS Host ‘would like’ to be woken up by the given Wake-Up Wake-Up Time Server.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 42 of 45

5.7 Search for Archived Frames

The Search for Archived Frames Tool, makes it possible for the operator to look up information for specific files or sets of files:

To specify wildcards “%” should be done as indicated above.

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 43 of 45

The search above produced the following result page when executed:

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 44 of 45

5.8 NGAS Last Night Report

The Last Night Report Tool can be used to generate a report of the files archived within a given context:

The above defined query, generates the following status:

Doc: VLT-MAN-ESO-19400-3103 NGAS Operations & Troubleshooting Guide Issue: 1 ESO Date: 2003-07-28 Page: 45 of 45

By clicking on the link on the number of files, the list of files produced are shown and more detailed information can be obtained.

5.9 Last Frames Archived

This tool is very essential to check that NGAS is working properly during operation:

The explanation in the page should explain well enough how to interpret the fields below. The page is update automatically every 5 minutes,

It is extremely important to keep an eye on this page during operation every now and then.

This tool is only relevant if the data provider for NGAS is DHS since the tool is based on having the OLAS DB updated with the information about the files produced and archived.

5.10 Label Printer Tool

If it is needed to generate labels at the ESO HQ, please contact the NGAS Team for the link to the WEB page allowing to carry out this action.