System Software Support of Hardware Efficiency

by Thomas Kaegi and Igor Schagaev

Today, computer systems are applied in safety critical areas such as military, aviation, intensive health care, industrial control, space exploration, etc. All these areas demand highest possible reliability of functional operation. However, ionized particles and radiation impact on current semiconductor hardware leads inevitable to faults in the system. It is expected that such phenomena will be observed much more often in the future due to the ongoing miniaturisation of hardware structures.

In this book we want to tackle the question of how system software should be designed in the event of such faults, and which fault tolerance features it should provide for highest reliability. We also show how the system software interacts with the hardware to tolerate these faults.

At first, we analyse and further develop the theory of fault tolerance to understand the different ways how to increase the reliability of a system. Ultimately, the key is to use redundancy in all its different appearances. We revise and further develop the general algorithm of fault tolerance (GAFT) with its three main processes hardware checking, preparation for recovery and the recovery procedure as our approach to the design of fault tolerant system. For each of the three processes, we analyse the requirements and properties theoretically and give possible implementation scenarios.

Based on the theoretical results, we derive an Oberon-based programming language with direct support of the three processes of GAFT.

In the last part of this book, we analyse a simulator based proof of concept implementation of a novel fault tolerant processor architecture (ERRIC) and its newly developed runtime system feature-wise and performance-wise.

"the most comprehensive and detailed account of fault tolerance that I've come across … required reading for anyone designing safety-critical embedded systems." - Geoffrey Sharman, Head of the British Computer Society Advanced Programming Group.

Buy e-book £40.00

Table of Contents:

    Glossary
  1. Introduction
  2. PART I A structured approach to fault tolerance
  3. Hardware faults
  4. Fault tolerance: theory and concepts
  5. Generalized algorithm of fault tolerance (GAFT)
  6. Conclusion for Part I
  7. Part II Approaches to hardware / software implementation of fault tolerance
  8. System software support for hardware deficiency: function and features
  9. Testing and Checking
  10. Recovery preparation
  11. Recovery & recovery monitoring
  12. Conclusion Part II
  13. Part III Implementation
  14. Programming Language aspects for safety critical systems
  15. Proposed runtime system structure
  16. Proposed runtime system vs. existing approaches
  17. The ERRIC architecture
  18. Architecture comparison and evaluation
  19. ERRIC reliability analysis
  20. References