S1219 NRCS: Disconnect Recovery -- Dual System, A Disconnected written 13Jun01 ______

Problem Description

I have a dual system (AB). The A computer has disconnected because of network troubles or a power failure. How do I recover and remirror my system?

Comments and Workaround

The following steps outline, in a nutshell, how to recover from a disconnect of the A computer, after the system has been reconfigured to run single on B and all users are running off B.

1) Select the A computer ONLY the control console (probably the F2 or F3 key)

2) Become superuser (if you're not already)

3) Reboot the computer with the following :

init 6

(You may be prompted to enter a reason for the reboot, you can choose something like "administrative")

The computer will go through an automatic reboot and eventually wind up at the login prompt.

4) Login as "so" (system operator), you will be at a question mark-colon prompt

5) Become superuser

6) Since the database on the A computer is now out of mirror with the other server, it must be wiped clean before it can be reconnected to the other one. Initiate a diskclear on the A ONLY:

diskclear -

You will be prompted as to whether you wish to SAVE the database. The answer is No. You will then be asked if you are sure you want to clear it, the answer is Yes.

The diskclear will a series of dots and numbers until the entire disk is cleared. This may take 30-60 minutes or so. Once the diskclear completes, you can proceed with the recovery process.

7) Select BOTH A and B on the console

8) Reconnect them to each other with the following command on both:

reconnect a master=b =ab

It will take a moment and then A will get it's normal, named prompt.

9) Select the A computer ONLY

10) Kick off a slow on A to the database back over to the diskcleared machine:

diskcopy -s

The diskcopy will run in the background as users continue working off B unaffected. It will probably take a number of hours to complete and will pop progress messages to the console as it goes.

The "status" command on A will report that the disk status is UNKNOWN. The disk status on A will stay unknown until the diskcopy completes, at which it will change to OK.

Once it wraps up, the only step left to get every back to full normalcy is to "split the load" and get the A devices, sessions, servers ("processes") running back on the A machine since all the processes are currently running off the B computer.

The EASIEST way to split the load is to wait until the diskcopy is done and then schedule a time when the users can be popped off and all devices stopped. This can be done tomorrow, or whenever convenient. Here are the steps:

11) Wait for the diskcopy to complete

12) Check the status on all computers:

status

They should all agree that:

System is AB

Indicating they are still connected to each other and operating as a dual system. Both should report that Disk Status is OK .

13) Select the B computer ONLY (following steps 14 through are all run on B)

14) Become superuser if your prompt does not already end in a pound sign

15) Take the system offline to prevent new logins:

offline

16) Let the users know it's going down with a broadcast:

broadcast System maintenance in 30 seconds. Logout IMMEDIATELY!

17) Log the users out:

logout all

18) Stop all the devices:

stop all

19) Reconfigure the system with the following command:

configure

Wait until you see the "system being configured" messages.

20) Restart all the devices on B:

restart all

21) place B back online:

online

You can hit a return at any time to test for the prompt on B to see if you can put it online. Once you see the Hot-to-go messages scrolling, the prompt should be there.

22) Select the diskcleared/diskcopied machine ONLY (the A computer ONLY)

23) Startup the A computer:

startup

The A has not been started up since it came up from it's reboot. This will restart all the devices on A and place it online.

That's it, you're back to normal.

While it is not strictly necessary to logout everyone and stop everything at steps 17 and 18 since you merely need to logout and stop the A computer's processes, it is usually quicker on a very large system to get everyone off and stop all the devices rather than taking the time to hunt and peck just the A processes off of B.