Overview of Disaster Tolerance at SMU
Disaster Tolerance Research
SMU is currently engaged in a formal research program to advance the development of Disaster Tolerance computing and communications systems. This site serves as the central repository for all research communication, documentation, research papers and relevant information. Research papers can be downloaded, uploaded and viewed at the DT Papers and Documents section.
Disasters
A disaster is an event that can cause system-wide malfunctions as a result of one or more failures within a system which may be caused by a single-point failure or by a plurality of single-points of failure that occur either simultaneously or nearly simultaneously in a temporal sense by either a man-made or natural event. A catastrophe can occur as the result of the occurrence of a disaster and may be avoided by using disaster avoidance mechanisms.
Fault-tolerance, or graceful degradation, is a property that enables a system to continue operating properly in the event of the failure of some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical systems. Significant strides in fault tolerance in information computing systems have been achieved since research in this field began in the 1960s.Within the scope of an individual system, fault-tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication.
Common components of fault-tolerance are:
- Replication: Providing multiple identical instances of the same system, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum;
- Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (fall-back or backup);
- Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
- Voting Mechanisms: incorporation of redundant system components and a voting mechanism to elect the most viable system as primary
- Error-Checking: correctional methods utilized to address faults as detected and provide for tolerance of defined failure events. Examples include watchdog software, hot swappable components.
BCP & Disaster Recovery
Disaster Recovery (DR), a subset of Business Continuity Planning (BCP), are closely related terms used to describe methodologies to create and execute a plan for how an organization will resume partially or completely interrupted Information technology (IT), organizational or business critical functions within a predetermined time after a disaster or disruption has occurred. These efforts are specifically targeted at reducing operational risk and therefore overlap with traditional risk management practices. DR and BCP commonly utilize IT along with fault tolerant systems and methodologies to help achieve recovery or continuity, and therefore also have common overlap with fault tolerant disciplines.
Disaster Tolerance (DT) in computing and communications systems refers to the ability of IT systems and communications infrastructure and business or organizational processes to maintain a degree of functionality after a disaster has occurred. Disaster Tolerance provides an ability to continue operations uninterrupted despite occurrence of a disaster that significantly interrupts normal organizational operations. Specifically, within DT, critical business functions and technologies continue operations, as opposed to resuming them.
Differentiation between the terms “fault tolerance” and “disaster tolerance” is required and evident due to the fact that fault tolerance system design usually addresses natural failure due to hardware wear and tear, software design errors, and unintentional user errors. Disaster tolerance is a superset of fault tolerance methods in that a disaster may occur which causes rapid, almost simultaneous, multiple points of failure in a system, as well as a single points of failure that escalate into a wide catastrophic system failures.
The Disaster Tolerance ProblemPotential Solutions in Disaster Tolerant Computing and CommunicationsIncreased Threat
The terrorist events of September 11, 2001 and the US Northeast power outage of August, 2003, combined with Hurricane Katrina of 2005 provide recent examples of devastating man made disasters and massively destructive natural disasters in the US which demonstrate a shared corporate and governmental inability to successfully resume or continue normal operations after these types of events occur. Businesses have become aware that they cannot rely on government institutions alone to successfully facilitate this process.Increased Organizational and Technology Complexity
As organizations grow and become more complex, dependence on IT systems to conduct business processes increases. Within the computing industry, single points of failure have been addressed and higher levels of IT system uptime have become more readily available through advances in fault tolerance and systems engineering. Organizations that wish to recover after disaster occurrence traditionally do so through Disaster Recovery (DR) and Business Continuity Planning (BCP) methodologies. However, increased risks from once thought impossible worst case scenarios for business continuity are now more prominent due to threats from man made disasters such as terrorist attacks. As a result, the need for more robust, reliable and available information and communications systems is increasing.Lack of Disaster Tolerant Technologies, Systems and Practices
Current practices in DR and BCP allow for some levels of recovery. However, these practices fail to adequately provide recovery capabilities for organizations to survive disasters such as the loss of an entire building, city block or large portions of a city itself. In these scenarios, cascading, virtually simultaneous, multiple point of failure events occur. Evidence indicates that strategy, priority, management, personnel and technology challenges surrounding DR and BCP render these practices ineffective. This existing lack of effective DR and BCP methodologies has crippling potential for businesses and government agencies alike. As a result, and there is a need for methodologies that allow for continued, uninterrupted business and IT service throughout such an occurrence.Organizations are challenged by the complexity and difficulties in successfully formulating, implementing and executing traditional DR plans and BCP initiatives. High availability IT applications and hardware for DR processes are increasingly complex and require significant manual, human interaction. Mounting evidence regarding execution of DR plans suggests that errors made in executive management strategy, personnel management within a crisis, miscalculations in impact and risk assessment, as well as incorrect assumptions and inadequately tested processes, are common errors which result in deficient DR and BCP efforts.
The small percentage of firms who have the foresight and resources available to effectively consider the risks and costs of mitigating against application downtime and major outages commonly invest in BCP and DR plans involving alternate ‘hot’, ‘warm’ or ‘cold’ failover sites, secondary geographically separate sites to which business and IT services and applications can be transitioned to in the event of a disaster. Unfortunately, such efforts are often done after an IT solution has been designed and implemented, not before, where it could have had the most beneficial effect on IT architecture and appropriate implementation.
Lack of Executive Visibility & Awareness
In addition to the complex technology challenges associated with DR, business and organizational executives commonly lack adequate comprehension of the risks, consequences, impacts, costs and probability of the occurrence of disasters. From a business perspective, spending money on products and services that a business does not use is not a prudent business decision. DR historically has incrementally higher costs for redundant IT infrastructure, personnel and related processes.Evidence indicates that a majority of business leaders tend to perceive DR as a poor investment that their business is unlikely to utilize, as they do perceive the risk and potential impacts of outages as significant enough to outweigh the costs of implementing DR. Among executive management, there is a prevalent lack adequate comprehension of the risks, consequences, impacts, costs and probability of the occurrence of disasters.
Executive Management Strategy and Investment in DR
Data indicates disaster preparedness is not a business priority for most US and UK companies, and a lack of executive visibility, responsibility and investment in corporate DR and BCP is prevalent. Executive leadership and management repeatedly fail to view DR and BCP as a means of investment and revenue stream protection. As a result, DR and BCP related risk reduction practices commonly remain unacknowledged and unappreciated Instead, they continue to be viewed as a dispensable corporate asset instead of one that would be invaluable if implemented functionally and successfully.Evidence indicates that a majority of business leaders tend to perceive DR as a poor investment that their business is unlikely to utilize, as they do perceive the risk and potential impacts of outages as significant enough to outweigh the costs of implementing DR. Business concern, risk comprehension and investment in DR and BCP are relatively unaffected by recent man made and natural disasters. Despite evidence of the high costs of business service interruptions and IT systems downtime, the perceived risk of an outage does not tend to outweigh the costs of DR and BCP for most businesses. As a result, executive corporate management continues to gamble with the stability, profitability and continuity of their businesses.
Lack of Adequate Testing
Inadequate or non-existent testing of DR and BC plans is prevalent in the majority of businesses in the US and UK. Most disaster recovery plans fail due to lack of testing, a failure that might otherwise be preventable by thorough testing of DR and BCP plans. Survey research has found 24 percent of companies neglect to test their disaster plan, especially in the United States, where 34 percent do not test at all. Among those who do not test their plans, 48 percent of companies state the top barrier to testing is reported as lack of time. Since 9/11, executives state their companies are in fact less prepared to deal with a disaster than in previous years. Survey data indicates that 24 percent of companies worldwide neglect to test their disaster plan, especially in the U.S., where 34 percent do not test. Among those that do not test their plans, 48 percent of companies state their top barrier to testing is specifically a lack of time to engage in DR testing.Inadequate Disaster Preparedness
In effect, data indicates disaster preparedness is not a business priority for most US and UK companies, and a lack of executive visibility, responsibility and investment in corporate DR and BCP is prevalent. These factors collectively indicate an inability for businesses to reach the goal of providing organizational continuance in the event of a traumatic disaster. In actuality, a large portion of organizational investment in disaster recovery and business continuity is literally wasted in the failed recovery processes itself, reducing the value of this investment in the small percentage of companies who make it. This lost investment does not produce the desired result: disaster tolerant business processes, IT applications, systems, and related infrastructure.Summary: Inability for businesses to reach the goal of organizational continuance in the event of a disaster
These factors collectively indicate an inability for businesses to reach the goal of providing organizational continuance in the event of a traumatic disaster. In actuality, a large portion of organizational investment in disaster recovery and business continuity is literally wasted in the failed recovery processes itself, reducing the value of this investment in the small percentage of companies who make it. This lost investment does not produce the desired result: disaster tolerant business processes, IT applications, systems, and related infrastructure.
Potential solutions for these business process and technology challenges may be found within evolving Disaster Tolerant methodologies. Such efforts would involve complete organizational and executive management commitment to achieving capabilities for disaster tolerance, not merely recovery. This may be achieved through a compliment of business investment executive visibility for risk comprehension and impact, and organizational Disaster Tolerance initiatives, involving business strategy, people, process technology and testing. These initiatives would also include provisions for designing and implementing IT systems and business process with functionality for redundant geographically disparate sites from the initial business process and IT systems solutions architecture.
Sources:
- “Realities of Disaster Recovery: Perceived Business Value, Executive Visibility & Capital Investment”, Lawler, SMU 2006
- “Disaster Tolerant Computer and Communications Systems”, NAIR, SZYGENDA, THORNTON, SMU, 2006

