Mission-critical reliability

From LQWiki
Jump to navigation Jump to search

Availability is a measure of performance for repairable systems that are intended to operate continuously to satisfy a mission. In short, availability is the probability that a system will be operating satisfactorily at some random time in the future when subject to a sequence of "up/down" cycles. Mission-critical availability, then refers to the availability of a system deemed critical to mission success. Obviously a necessary trait for key servers in corporate and government environments.

Since mission-critical elements of a project cannot be delayed without delaying the overall project, servers which support these key elements must have very high availability. A server with 99.9% uptime is on the low end of mission-critical availability. When discussing the availability of servers the 'five nines' rule is often referenced. This means 99.999% availability. In practical terms this means 5 minutes of downtime per year.

Concepts of System Availability

There are several ways system availability can be described. These are:

  • Instantaneous Availability
  • Mission Availability
  • Steady State Availability
  • Intrinsic Availability
  • Achieved Availability

Of these, the last two are of greatest interest to the system owner.

Intrinsic availability is primarily a function of basic system design. It can be viewed as the "best achievable" for a given system and is the version of this metric discussed during system design. Mathematically, intrinsic availability is:

AI = MTBF / (MTBF + MTTR)
where MTBF = Mean Time Between Failure
      MTTR = Mean Time to Repair

So it is easy to see from this that availability is a function of system reliability and system maintainability. Intrinsic availability does not consider administrative or logistics time associated with repair. Preventive maintenance time is also not considered when discussing intrinsic availability.

Achieved availability is the availability that the system owner experiences in use. Mathematically, achieved availability is:

Aa = 1 - (Downtime/Total Time)

In this case, downtime includes administrative, logistic, and preventive maintenance time.

R.F. Drenick stated in 1960 that, in the long run, failure times tend to the exponential as system complexity increases. Drenick's Theorem is often used to simplify availability analysis as it allows the use of constant MTBF over the life a system.

In recent years, and as computational power of computer systems has increased, more complex and likely realistic models of system behavior have been introduced. Many of these incorporate the concept of "virtual age" by which repairs "remove" some of the accumulated age.

Techniques to Improve Availability

Several techniques exist to improve system availability. The most often used is redundancy at some lower level in the system. Often, a more cost-effective approach is to use more reliable and more maintainable parts.

For example, consider a computer with two hard drives comprising the "data storage system." Assuming 160GB of storage is required, this system could be constructed of two 80GB hard drives; the failure of either drive is considered a system failure.

On the other hand, the system could be constructed of two 160GB hard drives connected in a RAID 1 configuration. In this case, both drives would have to fail to cause system failure.

Assume the time to replace a failed drive is two minutes. To achieve the 0.99999 availability, the first case requires hard drives with MTBF = 200,000 hours. In the second configuration, each hard drive requires a MTBF = 623 hours.

Availability Modeling

There are several methods for modelling system availability. The most common include state-space (Markov), Petri-net, reliability block diagrams, and fault trees. All of these are calculated either by direct analytical solution or simulation.

See the external links for more in-depth information regarding modeling techniques and tools.

External Links

System Reliability Center[1] Relex[2] Reliasoft[3]