Cluster Paper

Mission Critical Systems - RAID & Clusters
Presenter: Peter M DeVita, P.Eng. of DeVita Associates,
(with support from Sanjay Sehgal, Prod Mgr of American Megatrends)
Phone: 905-784-4400

Abstract:
RAID - a new kind of bug spray for computers or what?
Given the complexity of today's computers it is a wonder that they run at all. Yet users' require and demand flawless operation. The nature of hardware is that it WILL fail at some point in time. (Even software has bugs, - known and unknown.) These are facts that cannot be changed; however, we can do something to mitigate damages.
Mechanical devices like hard discs, are the most prone to failure. Hence, a system's up time can be significantly increased by concentrating on what to do when there is a failure. This is the essence of a RAID array of discs (Redundant Array of Independent Discs). There are various levels of RAID, but essentially, the idea is to write redundant data to two or more discs. If data is lost in one location, the second can be used to recover the original data. RAID 0,1, 5 & 50 will be discussed as some of the more popular arrangements. AMI's RAID controllers will be introduced as one of the industry standard products used by many OEM's today.
Clustered systems build on the RAID technology and applies these ideas to the logical extension of a whole system acting redundant to a second system. The latest AMI products in this field will be discussed as examples of Clustering solutions in a Quad Pentium-2 system for high end PC, mission critical applications.

Contents:

Abstract:
1. Why RAID it?
2. RAID levels -Level 1
3. RAID- 0 : Striping
4. RAID-5
5. RAID-50
6) Uptime and MTTR
7) System Improvement
    A) Hot Swap (Disc)
    B) Hot Spare
    C) Duplexing
    D) Battery Backed-UP Cache
    E) Redundant Power Supply
    F) Hot Swap Power
    G) UPS
    H) Tape Backup
    I) Hot Swap Fans.
    J) System Health Monitoring.
8) "Remote Assistant Components"
9) Clustering
    Hardware Considerations
    AMI Components
    DeVita Associates System - Rack Mount
10) Summary/Conclusions
11) About the Author and Company:

1. Why RAID it?

Have you ever had a system crash? How about a power failure in the middle of saving an important piece of work?

Weaknesses is computer handware, environmental stresses or just plain software bugs can cause loss of important data. The consequences vary enormously. The home user may be annoyed at needing to re-enter a few hours of work. But the bank that has just electronically transferred 1 Billion dollars is going to have a very feverish concern at a failure during the transfer. There are many examples dealing with funds, private information like medical records, or security data in the military, in which the sudden loss of data is catastrophic.

There is no doubt that in time as costs decrease, that every computer sold will have some level of RAID. It will, un-noticed by the user, simply correct and recover data. For now, only the more "mission critical" applications can justify this expense.

Our objective is to provide data protection. This is usually talked about in terms of recovering from any single point of failure.

2. RAID levels -Level 1

Since the hard disc in the main data storage device, it only makes sense to look at how we can make this component of the computer more reliable. The manufacturers of hard disc devices have certainly assisted in this mission by improving their production techniques. It is common to now see MTBF (Mean Time Between Failure) in the 1 million hr. level compared to 20,000 hr at the beginning of the PC era. Nevertheless, 'mean times' are based on statistics and probability of failure. For any single disc, failure in all or part of the disc is a random event. What happens when a failure occurs?

RAID (Redundant Array of Independent Discs) provides the answer to this question. In a RAID 1 array (level 1) two discs mirror each other. In essence, data is written to both discs at the same time. If one disc crashes, the other disc can be used to recover and thereby reconstruct the data on the failed disc.

RAID arrays are typically built with SCSI controllers. The firmware on the controller can detect the occurrence of a failure and take some action, like sounding an audible alarm, to notify operators. The MegaRAID Controller from AMI behaves in this manner. We will use this device to refer to practical implementations of concepts.

There are other RAID levels. We will discuss levels ,5,10 and 50.

3. RAID- 0 : Striping

RAID 0 provides a method for Striping data across an array of discs. Assuming an array of 4 discs, the controller in RAID 0 operation will split the data into four sections and write the data across all four discs. Striping does not improve reliability of the data, but it does not improve performance. This is due to the fact that semiconductor speeds are generally several orders of magnitude faster than a mechanized device like a hard disc. Hence, the SCSI controller can easily cycle a write request to each hard disc and be back to the first disc before the disc heads are aligned. At a system level, it appears that the stripped data is written to all four discs in parallel. Of course there are limits.

For example, a MegaRAID, 3 channel controller can handle up to 45 hard discs on its own. [As an aside, up to 6 such controllers can be installed in a system which opens the door to very large and very fast disc arrays.] With Ultra-2 discs each disc can stream data maximum of 80 Mb/s. The PCI (V2.1) bus has a capacity of 266 MB/Sec.

The option of RAID 0 opens the possibility of tuning the performance to an optimum in which bandwidth is matched from disc controller to PCI bus to memory.

4. RAID-5

This level of RAID is the most popular for 'transaction' processing (many small burst of data verses a long continuos stream.) which is the common data exchange found in a commercial office. In a RAID 5 array data is stripped. Then a "CRC" is computed for the stripped data. With four discs the data would be stripped to three discs and the CRC stored in the fourth. For each write of data the discs save the CRC changes.

If a data track or an entire disc fails, the controller will detect the failure on reading back the data and comparing to the CRC. Lost data is reconstructed and saved properly. All this happens transparently to the user.

Of course there are more complexities. In rewriting an existing file the stripped data must first be read, modified with the new data, a new CRC computed, and then written back to the discs. This takes more time than a straight write. Fortunately, the effects of Striping help keep RAID 5 arrays at high speeds. Reads will be faster than on a single disc. There are more reads than writes so generally the RAID 5 array shows improved performance. Fundamentally RAID 5 provides high reliability by allowing continued operation on a single point of failure (1 hard disc).

5. RAID-50

RAID 5 arrays can be made very large. Six controllers with 45 discs each allows up to 270 discs! The chances of a signal failure increase as the number of discs increases. A RAID 50 array allows an improvement in both performance an reliability.

Under RAID 50 the bank of HDD is divided into 2 groups of RAID arrays. Data is them stripped to these arrays. For example,. 8 discs would be divided into 2 groups of 4 discs which each sub-group logically organized as a RAID array. The data is then striped to the 2 banks.

This organization delivers a performance boost from the Striping. Since we now have 2 RAID 5 arrays, we can tolerate a single point of failure in each array.

RAID 50 begins to show the complexities of the firmware in a RAID controller. The entire bank of drives is seen as 1 logical drive by the user. The RAID firmware provides the system manager with the ability to manipulate physical drives into the logical configuration desired. It is also the firmware's responsibility to notify the system manager when a disc error occurs.

6) Uptime and MTTR
Now that basic RAID functions are covered we look to the design of systems to improve 'Up time'. This is the time that the system is not down or off. (Ideally, 24h/day. 7 days/week.) Even though a RAID array, (1 or 5) can give us extra protection for our data, we eventually need to stop the system to replace the disc. The time taken to replace the disk becomes part of our time 'Mean Time To Repair' (MTTR) The longer the MTTR, the longer the user is without computer service.

The balance of this paper is aimed at providing system level improvement that assist in the reliability and up time of a system (usually a server).

7) System Improvement
Below we discuss the following element:
a) Hot swap
b) Hot spare
c) Duplexing
d) Battery backed up cache
e) Redundant power supplies
f) Hot swap 'Power supplies'
g) UPS
h) Tape backup
I) Hot swap Fans
j) System health monitoring

We will then discuss Clustering and finally the concept of remote analysis is a product of MegaRAC.

A) Hot Swap (Disc)
The MegaRAID controller will support the renewal of a hard disc and the installation of a new disc. The firmware will allow the re-build of the array. With a software product called MegaFlex, the array configuration and rebuild can be controlled by the system administration while Windows NT is running. These changes can all be done while servicing the users.

At a system level, in hardware, on cannot simply open the system and disconnect power and SCSI cable without problems. Hence, discs must be housed in 'Docking Kits' which provide for key switch power disconnect. We also like to use docking kits with SCSI ID setting control. The SCSI cable termination should be done with an external termination on the cable, not with the last disk on the cable chain. This avoids the problem of removing the discs with the SCSI termination.

With a docking tray them MTTR is very fast in that the system need not be opened. The docking tray is simply unplugged from the front and a new drive inserted. ]

A spare disc and docking tray will improve the swap time even more.

B) Hot Spare
If you have gone to the expense of a spare disk and tray, this can be kept in the system with power but no activity. The MegaRAID firmware will identify the spare, or a Hot Spare. In case of a disc failure the Hot Spare will be brought on-line and the failed disk removed automatically. The operator will be notified so that the failed disk can be replaced with a new Hot Spare.

Notice that even though there will be time needed to replace the hard disk that the system has never stopped. The MTTR does not hurt system up time and ability of users to access data.

C) Duplexing
It is possible that other hardware will fail. These are most likely to be fundamentally associated with high energy, heat and electric motors that create spikes (discs). Hence the disc controllers and power supplies command our attention next.

Duplexing is the ability for 2 controllers to control the same RAID array in a redundant mode. Should one controller fail, the second takes over operation.

As with hot swap discs it is possible to also have hot swap controllers. In this case, the motherboard must support the ability to hot swap. Also, since the controller is at one end of the cable, the SCSI termination needs to be properly handled.

D) Battery Backed-UP Cache

RAID Controllers like the MegaRAID house cache memory to improve performance. (32 to 64MB typical Size). It is possible for a power failure to occur in the process of a disc write. In such a case the operating system will believe that it has written data successfully, however it is in cache memory and has not reached the disc. By supplying a battery to keep the cache alive the data is saved, the data is saved in this memory.

Without this feature, on restart the system will detect corrupt data in the file and will be lost. With the battery backup the controller has the smarts to begin operation at the point it was stopped and thereby complete the write operation properly.

E) Redundant Power Supply
This refers to 2 or more power supplies acting in parallel to share a load. However, one cannot simply tie all the output lines together and expect proper operation. The modern switching power supplies is designed to operate with a load. Special circuitry is required for no load to prevent the unit from burning itself out. Connecting a "negative" load (another power supply ) makes this even worse. Hence, a 'switch' is required to allow these power supplies to connect. Unfortunately, the switcher is generally for more expensive than the power supplies themselves.

F) Hot Swap Power

As well as redundant power, it is useful to have hot swap power. Hence, if a failure occurs, the failed supply can be replaced without shutting down the system.

The MegaPlex System from AMI shows 3 power supplies in a redundant hot swap arrangement.

G) UPS
Power failures are well beyond our individual control. However, in critical applications, (say a heart monitoring system) a UPS provides a stopgap power bridge. Typically, from 20 to 60 minutes of time is provided. This should be enough time to either bring up the generator or to log off all users and (gracefully) shut down.

H) Tape Backup

For the sake of completeness, real critical data should always have a secondary backup. Tape drives or even Read-write CDs are good choices.

I) Hot Swap Fans
If a system overheats. permanent damage can occur to the CPU, memory, and other components. Proper air flow is essential for system cooling. It is possible (again using the MegaFlex system example) to have hot swap fans which can be replaced if failed. without a system power down.

J) System Health Monitoring
System boards today supported by the BIOS can monitor a series of parameters. Typically:
I) internal case temperature
II) CPU temperature
III) System fans
IV) CPU fan
V) Voltage levels
VI) System intrusion

By monitoring these, system failure can be minimized and preventative maintenance action can be taken early, before a permanent failure and before any user is prevented from doing their work.

It is possible to monitor these parameters remotely (using Mega RAC as an example) so that proper support can be called in as required.

8) "Remote Assistant Components"
At this point it is well to briefly mention this new trend in computing. The ability to respond before a failure occurs in the ultimate in preventative maintenance. The previous section has already discussed key system health parameters that can provide early warning information.

RAID systems will also have discs that most likely fill gradually. Tracks and sectors begin to have problems before an entry disc is lost. By monitoring these smooth failures we can determine when the disc is going bad.
as one example the General Alert module that is an option with AMI's MegaRAID, provides the ability for the system to initiate a call inform a system manager about an array problem. This could be a simple audible alarm or an e-mail with an error message and system ID.

Another example is the new Mega RAC controllers. This is an intelligent card based on the Intel I960 RISK processor and a proprietary ASIC created by AMI. This board is batter backed up so even with a general power failure it can report remotely. As well as all the system health parameters, the Mega RAC has built in diagnostic ability. The boot process can be monitored remotely ( for all POST codes). The video stream is captured and can be viewed remotely. Keyboard input can be provided remotely to walk a system through a boot process. Also, a watchdog timer is provided to reboot a system that has locked up.

In short, system health can be monitored remotely to provide early warning and response. As well, stop gap measures can be exercised remotely while a technical person travels to the system site.

9) Clustering
Our last stop will be with the concept of clustering . In broad terms, this refers to a group of computers. More precisely it refers to an old concept used by mainframes and super mini computers, generally supported by UNIX style operating systems. Clusters can be logically thought of as a group of servers that share data to provide what Microsoft refers to as "accessibility". This simply means that reliable data is always available. A cluster extends the concept of a RAID array of drives to a redundant array of systems

In the past, clustered systems were very expensive. running into the millions of dollars. With the introduction of Microsoft phase 1 cluster software for windows NT (enterprise) and AMI complementary hardware we now have the ability to provide cluster servers at the PC level.

Compaq, Digital, HP, IBM and Intel, NCR, Tandem are all involved with Microsoft's initiative. AMI is the first to have a cluster kit, certified by Microsoft provide the essential components for the cluster system.

This discussion will be limited to Phase 1 of the Wolfpack cluster software as code named by Microsoft.

Microsoft refers to Pfister definition of a cluster as:
"A parallel or distributed system that consists of a collection of interconnected whole computers, that is utilized as single, unified computing resource."

Microsoft also states that " the goal of a cluster is to make it possible to share a computing load over several systems without either the users or system administrators needing to know that more than one system is involved."

The Wolfpack phase 1 cluster uses a "Shared Nothing" Organization.

This means that any resource is owned and controlled by any one in the cluster at any one time. A system in a cluster is referred to as a Node. A system in a cluster is referred to as a Node. In this type of organization, resources are owned by a 'node'. When a failure occurs, the failing node loses control of all its controlled resources to the failover node. In this way. Intricate sharing schemes are avoided.

When a failure occurs in a Node, a Failure occurs to the remaining node moving ownership and control of resources from the failed node. The failure is transparent to the User.

On recovery (repair of failed node) a Fallback occurs restoring the Cluster.

Hardware Considerations

From the hardware point of view, all system components must be redundant so that there are not single points of failure. The general architecture is shown in the diagram. In essence, two full systems are connected in a way to control a RAID 1 array of discs.

The RAID 1 array is a mirrored array as previously described and, therefore, has built in redundancy.

Furthermore, the array uses a dual bus architecture so that we can even lose a cable without going down.

Each node must have a RAID controller card. In our example, the MegaRAID '428' controller is used. The controller has been modified to be Cluster aware. Hence, when one controller sees another controller on its SCSI bus it able to identify its counterpart in a Cluster. The System motherboard must also have modifications to be cluster aware.

One important consideration in the RAID Cluster Structure is the SCSI bus termination. The bus must remain properly terminated (to allow reliable access to the data) at all times. A cable removal, system removal or power failure on one node must not interfere with reliable access. This is accomplished with the AMI bus extender/terminators.

These are the essential components in a Cluster. These, of course, require the Windows NT Cluster software to complete the systems operation.

A final note on the hardware configuration is that dual UPS's each tied to separately fused AC lines should be used. This completes the full redundancy of the System.

AMI Components

AMI has created a cluster kit consisting of a pair of RAID controllers made Cluster ready as well as the RAID storage unit and components to make ensure full redundancy and proper SCSI termination on the failure of any single component.

DeVita Associates System - Rack Mount
Cluster Systems require cluster ready components, including the motherboard. Also, the installation of key operating software components must be cluster aware. If they are not, a method must be found to make them cluster ready. Given the complexity of making a Cluster system work, DeVita Associates has decided to offer complete turn-key systems using the AMI components and Operating system from Microsoft.

Below is our Rack Mount Cluster system, which is one convenient package for a full Cluster. The main Computers are based on either a dual P2 motherboard or a Quad Ppro System. A Quad P2 System is also available but requires a different packaging to accommodate the unique physical features of the Quad P2 board.

Other options include a second storage Rack with all the accompanying cabling. The customer may also choose the size and quantity of hard disc drives required. As an added point of redundancy on the Network, we also offer dual network cards for the global net [plus an NIC for the inter (local) LAN connection ] along with an Adaptive Switch. This arrangement maintains connectivity even with an NIC failure. Indeed, up to 4 extra NIC's may be added. Each NIC will share the total load. On a NIc failure, the remaining NICs will adapt and share the total load without any loss to the user.

Space
UPS #2
System #2
Cluster Storage Rack #2 (opt)
Space for air flow
Cluster Storage Rack #1
System #1
UPS #1

Notes:
1. System options include - MegaRum (dual P2) based systems or Goliath-2 (Quad Ppro) based systems
2. 19" Rack Mount unit comes with front door thermal exchange unit.

10)Summary/Conclusions

This paper has briefly covered developments in PC systems today that deal with reliable, mission critical systems. More and more applications continue to be added to computers. The Internet has become a backbone of commerce. Cutting the vital link can effectively shut down an organization. Computers control the link to Internet highways. This is one universal example that will drive to need to have 100% uptime and, therefore, the desire to have RAID and Cluster systems as essential, commonplace components of an enterprise wide network of computers.

11)About the Author and Company:
Peter M DeVita, MASc, MBA, P.Eng is a graduate of UofT faculty of Engineering. He is a member of the PEO and is currently active on its Council as its elected VP. Eng. DeVita's MBA is from York U. Eng. DeVita is part owner of DeVita Associates. The firm has been Associated with AMI since 1986 and represents AMI in Canada as an agent and a Value added Distributor for all of AMI's products including the BIOS, Diagnostic tools, motherboards, RAID controllers and Cluster Systems. As well, DeVita Associates integrates non-desk top solutions with AMI and American Advantech components, a supplier of 19" rack mount components and real time data acquisition products. DeVita Associates customized hardware integrated with customers' software of choice.
American Megatrends recently certified Eng. DeVita as a Cluster Specialist on AMI cluster systems and components.

Peter DeVita
Sept. 27, 1998

[Systems] [Components / Software] [Design Service] [Geophysical Survey] [Technology] [Profile] [Feedback] [How to Buy]

DeVita Associates, 250 Harding Blvd, Box: 32228, Richmond Hill, Ontario L4C-9S3
Tel: (905) 784-4400, Toll Free: (888)-523-9105 Fax: (905)784-4401 Email: