Software Recovery Tests Overview

Many computer based systems must recover from faults and resume processing within a pre-specified time. In some cases, a system must be fault-tolerant, meaning processing faults must not cause overall system function to stop. In other cases, a system failure must be corrected within a specified period of time or severe damage will occur (Damage can be economic, health related, physical, etc.)

To test a system’s capability to handle faults, recovery testing is performed. This testing approach forces the software to fail and verifies that recovery is properly performed. It is essential for any mission critical system (Example – FDA class 3 products, defense systems, etc.) The importance of these systems is such that that by their nature they impose a strict protocol of how the system should behave in case of a failure. Other examples include large financial systems, banking, logistics and many more.

Recovery testing is executed to estimate the length of time for recovery and how effectively the application can return to normal when it undergoes any disturbances to proper operation. The disturbances that are taken into account vary from system to system according to their nature and the analysis done as to the potential disturbances that might affect such systems. Of course each product / industry is involved with totally different recovery-related challenges that affect this analysis, such as:

  • Healthcare & medical devices: Healthcare products are developed under strict regulations, both by the FDA and by the companies themselves. This is especially true when dealing with embedded systems. The same applies when executing recovery tests. All parts of both the environment and the tests themselves should be validated, and linked directly to the strict requirements. Hence, building the recovery test environment and test plans are a handful due to these constraints, on top of identifying the problematic fault analysis.
  • Defense industry: Many of the embedded systems are actually systems of systems, meaning the integration between the different systems is a complex reason alone for recovery-related issues. On top of that, many military systems are exposed to intense environmental conditions due to their nature, and simulating those is a complicated procedure.
  • Cloud based applications: Large-scale computing and data storage systems, including clusters within Google, Amazon, and elsewhere, are becoming a dominant platform for an increasing variety of applications and services. These “cloud” systems comprise thousands of commodity machines (to take advantage of economies of scale ) and thus require sophisticated and often complex distributed software to mask the reliability of commodity PCs, disks, and memories; this often makes recovery analysis and execution quite a challenge.

The main question that comes to mind at this stage is how to do this analysis, where making it is actually making a large portion of the test plan itself. The next table presents key points of failure to think about, most of which are relevant for the majority of recovering application tests:

As a software recovery testing engineer, one does not only approve the recovery method, but also the reliability of the integral parts of that procedure.

The tester must ensure that the following tasks are done before executing a recovery test:

  • Recovery analysis  (shown above): This phase might also include planning a failover test, which determines whether a system is able to allocate extra resources such as additional CPU or servers, either during critical failures or at the point the system reaches a predetermined threshold.
  • Test plan preparation: designing the test cases according to the analysis results and environment.
  • Test environment preparation: Designing the environment depends heavily on the former stage, since each detail in the prepared environment should link directly to one of the external effects we want to test with the system.
  • Maintaining sufficient back-up information such as various states of software, a database containing valuable data such as customer’s details, or anything else.
  • Backing up the data in multiple locations.
  • Allocating and educating recovery personnel
  • Documenting recovery techniques that are being followed.

Testing a process for recovery of unpredictable problems is not simple and usually takes much in the way of resources, but on the other hand has the following benefits:

  • Knowing that the disaster recovery plan works, eliminating risk and improving system quality.
  • An educated staff in executing tests and managing an actual disaster recovery.
  • Discovering problems, mistakes, and errors, and resolving them before these procedures have to go live.
  • Awareness of the disaster recovery strategy and its importance for business continuity operations is increased.
  • Recognition of the members of the organization, both IT and business, of the need disaster recovery, and the benefits of planning accordingly.
Fail-over cause Possible impact Impact severity (critical / High / medium / low)* How to simulate
An External server un-reachable I/O from server failure, causing error or crash. High – critical Disconnect the server in different points in time. Points should be selected so they simulate major possible app states.
Server reachable yet not responding as expected. Probable error, crash in extreme cases Medium – high – by error Simulate wrong responses on the server side.
Power supply failure Error up to total shutdown in case a failure in auxiliary power source Critical Unplug power source. Change Power strength suddenly.
Wireless network signal loss I/O from network stops. Error in most cases. Medium – high Change network settings on the OS system or shutdown the network if it is a local one (if possible).
External device not responding I/O from device stops, Error in most cases. Medium – high Shut down / unplug the device or change a relevant setting if there is that will stop generating I/O
External device responding in an unexpected way Probable error, crash in extreme cases Medium – high – by error Simulate wrong responses on the device side.
Physical conditions, such as temperature, humidity etc. Slower response, Application stuck , total shutdown. Critical Expose the whole environment to adverse physical conditions as possible, and run all tests within it
Electrical disturbances from near devices Errors, Slower response. Medium – high Control / change proximity of signal generating devices from one another while tests are run. Should have distinct set of tests that we know their outcome to isolate this factor.
Service stopped Error, Application stuck. Sometimes no/minimal effect Low – high
Missing resources (such as dll) Error / crash med – high Removal of resources while working.
Co-existence (for example how does Chrome work if you have other browsers installed) Wrong behavior / error Low – Med Use real life stations with other apps installed on a PC for example
DB overload Slow response time / error Low – med Make a load test of the application with relevant tools.
Disconnected network I/O from network stops. Error in most cases. Medium – high Change network settings on the OS system or shutdown the network if it is a local one (if possible)
Network overload Slow response time on I/O from network to none, can cause errors / stuck Medium – high Generate load on the network, specifically on the components that should generate I/O to the AUT
Different Network issues such as: Jitter, Packet Loss, Packet mis-order Different Errors / misbehavior Low – med Simulate each network issue.

*Severity is relative to other failover issues, not to the AUT’s severity definitions

As a software recovery testing engineer, one does not only approve the recovery method, but also the reliability of the integral parts of that procedure.

The tester must ensure that the following tasks are done before executing a recovery test:

  • Recovery analysis (shown above): This phase might also include planning a failover test, which determines whether a system is able to allocate extra resources such as additional CPU or servers, either during critical failures or at the point the system reaches a predetermined threshold
  • Test plan preparation: designing the test cases according to the analysis results and environment.
  • Test environment preparation: designing the environment depends heavily on the former stage, since each detail in the prepared environment should link directly to one of the external effects we want to test with the system.
  • Maintaining sufficient back-up information such as various states of software, a database containing valuable data such as customer’s details, or anything else.
  • Backing up the data in multiple locations.
  • Allocating and educating recovery personnel
  • Documenting recovery techniques that are being followed.

Testing a process for recovery of unpredictable problems is not simple and usually takes much in the way of resources, but on the other hand has the following benefits:

  • Knowing that the disaster recovery plan works, eliminating risk and improving system quality.
  • An educated staff in executing tests and managing an actual disaster recovery.
  • Discovering problems, mistakes, and errors, and resolving them before these procedures have to go live.
  • Awareness of the disaster recovery strategy and its importance for business continuity operations is increased.
  • Recognition of the members of the organization, both IT and business, of the need disaster recovery, and the benefits of planning accordingly.