Fault Tolerance Protection and Raid Technology for Networks: a Primer

51-30-25 DATA COMMUNICATIONS MANAGEMENT FAULT TOLERANCE PROTECTION AND RAID TECHNOLOGY FOR NETWORKS: A PRIMER Jeff Leventhal INTRODUCTION According to a recent Computer Reseller News/Gallup poll, most networks are down for at least 2 hours per week. The situation has not got- ten any better for most companies in the past 3 years. If an organization has 1000 users per network, this equals one man-year per week of lost productivity. Even if a network is a fraction of that size, this number is imposing. For nearly a decade, many companies responded by deploy- ing expensive fault-tolerant servers and peripherals. Until the early 1990s, the fault-tolerant label was generally affixed to expensive and proprietary hardware systems for mainframes and mini- computers where the losses associated with a system’s downtime were costly. The advent of client/server computing created a market for similar products created for local area networks (LANs) because the cost of network downtime can similarly be economically devastating. Network downtime can be caused by anything from a bad network card or a failed communication gateway to a tape drive failure or loss of a tape used for backing up critical data. The chances that a LAN may fail increase as more software applications, hardware components, and users are added to the network. This article describes products PAYOFF IDEA that offer fault tolerance at the sys- What can an organization do to increase server tem hardware level and those that uptime and reduce, or even eliminate, network downtime? In many cases, a RAID system — a use fault-tolerant methods to protect collection of disks in which data is copied onto the integrity of data stored on net- multiple drives — is added to a network to speed work servers. The discussion con- access to mission-critical data and protect it in cludes with a set of guidelines to the event of a hard disk crash. This article dis- help communications managers se- cusses RAID technology and the use of fault-tolerance protection to preserve the availability and lect the right type of fault-tolerant integrity of data stored on network servers. 10/97 Auerbach Publications © 1997 CRC Press LLC solution for their network. This article also discusses RAID (redundant array of independent [formerly “inexpensive”] disks) technology, which is used to coordinate multiple disk drives to protect against loss of data availability if one of the drives fails. DEFINING FAULT TOLERANCE PC Week columnist Peter Coffee noted the proliferation of fault tolerance in vendor advertising and compiled a list of seven factors that define fault tolerance. Coffee’s list included safety, reliability, confidentiality, integrity, availability, trustworthiness, and correctness. Two of the factors — integrity and availability — can be defined as follows: • Availability is expressed as the percentage of uptime and is related to reliability (which Coffee defined to be mean times between failures) because infinite time between failure would mean 100% availability. But when the inevitable occurs, and a failure does happen, how long does it take to get service back to normal? • Integrity refers to keeping data intact (as opposed to keeping data se- cret). Fault tolerance may mean rigorous logging of transactions, or the capacity to reverse any action so that data can always be returned to a known good state. This article uses Coffee’s descriptions of availability and integrity to distinguish between products that offer fault tolerance at the system hardware level and those that use fault-tolerant methods to protect the data stored on the network servers. Availability The proliferation of hardware products with fault-tolerant features may be attributable to the ease with which a vendor can package two or more copies of a hardware component in a system. Network servers are an example of this phenomenon. Supercharged personal computers equipped with multiple power supplies, processors, and input/output (I/O) buses provide greater dependability in the event that one power supply, pro- cessor, or I/O controller fails. In this case, it is relatively easy to synchro- nize multiple copies of each component so that one mechanism takes over if its twin fails. Cubix’s ERS/FT II. For example, Cubix’s ERS/FT II communications server has redundant, load-bearing, hot-swappable power supplies; multiple cooling fans; and failure alerts that notify the administrator audibly and through management software. The product’s Intelligent Environ- mental Sensor tracks fluctuations in voltage and temperature and trans- mits an alert if conditions exceed a safe operating range. A hung or failed system will not adversely affect any of the other processors in the system. Vinca Corp.’s StandbyServer. Vinca Corp. has taken this supercharged PC/network server one step further by offering machines that duplicate any server on the network; if one crashes, an organization sim- ply moves all its users to its twin. Vinca’s StandbyServer exemplifies this process, known as mirroring. However, mirroring has a significant drawback — if a software bug causes the primary server to crash, the same bug is likely to cause the secondary (mirrored) server also to crash. (Mir- roring is an iteration of RAID technology, which is explained in greater detail later in this article.) Network Integrity, Inc.’s LANtegrity. An innovative twist on the mirrored server, without its bug-sensitivity drawback, is Network Integrity’s LANtegrity product in which hard disks are not directly mirrored. Instead, there is a many-to-one relationship, similar to a RAID system, which has the advantage of lower hardware cost. LANtegrity handles backup by maintaining current and previous versions of all files in its Intelligent Data Vault. The vault manages the most active files in disk storage and offloads the rest to the tape autoloader. Copies of files that were changed are made when LANtegrity polls the server every few minutes and any file can be retrieved as needed. If the primary server fails, the system can be smoothly running again in about 15 seconds without rebooting. Be- cause all the software is not replicated, any bugs that caused the first server to crash should not affect the second server. NetFRAME Servers. The fault tolerance built into NetFRAME’s servers is attributable to its distributed, parallel software architecture. This fault tolerance allows the adding and changing of peripherals to be done without shutting down the server, allows for dynamic isolation and connection of I/O problems (which are prime downtime culprits), dis- tributes the processing load between the I/O server and the central processing unit (CPU), and prevents driver failures from bringing down the CPU. Compaq’s SMART. Many of Compaq’s PCs feature its SMART (Self- Monitoring Analysis and Reporting Technology) client technology, al- though it is limited to client hard drives. If a SMART client believes that a crash may occur on a hard disk drive, it begins backing up the hard drive to the NetWare file server backup device. The downside is that the software cannot predict disk failures that give off no warning signals or failures caused by the computer itself. DIAL RAID FOR INTEGRITY In each of the previous examples, the fault tolerance built into the systems is generally designed to preserve the availability of the hardware system. RAID is a technology that is probably the most popular means of ensuring the integrity of corporate data. RAID (redundant arrays of independent disks) is a way of coordinat- ing multiple disk drives to protect against loss of data availability if one of the drives fails. RAID software: • Presents the array’s storage capacity to the host computer as one or more virtual disks with the desired balance of cost, data availability, and I/O performance. • Masks the array’s internal complexity from the host computer by transparently mapping its available storage capacity onto its member disks and converting I/O requests directed to virtual disks into oper- ations on member disks. • Recovers data from disk and path failures and provides continuous I/O service to the host computer. RAID technology is based on work that originated at the University of California at Berkeley in the late 1980s. Researchers analyzed various performance, throughput, and data protection aspects of the different ar- rangements of disk drives and different redundancy algorithms. The fol- lowing table describes the various RAID levels recognized by the RAID Advisory Board (RAB), which sets standards for the industry. RAID Level Description Benefits Disadvantages RAID 0 Disk stripping: Storage is maximized Has virtually no fault data is written across across all drives, tolerance multiple disk drives features good performance and low price RAID 1 Disk mirroring: data is Data redundancy is Slower write performance, copied from one drive to increased 100%; has fast but twice the disk drive the next read performance capacity, more expensive RAID 2 Spreads redundant data Has no physical benefits Has high overhead with across multiple disks; no significant reliability includes bit and parity data checking RAID 3 Data stripping at a bit level, Has increased fault Is limited to one write at requires a dedicated parity tolerance and fast a time drive performance RAID 4 Disk stripping of data Has increased fault Slower write performance, blocks, requires a tolerance and fast read not used very much dedicated parity drive performance RAID 5 Disk stripping of both data Features increased fault Write performance is slow and parity information tolerance, efficient performance, is very common The redundancy in RAID is achieved by dedicating parts of an array’s storage capacity to check data. Check data can be used to regenerate in- dividual blocks of data from a failed disk as they are requested by the applications, or to reconstruct the entire contents of a failed disk to restore data protection after a failure.

Load more