Formale Grundlagen der Fehlertoleranz in verteilten Systemen

Gärtner, Felix Christoph (2001)
Formale Grundlagen der Fehlertoleranz in verteilten Systemen.
Technische Universität Darmstadt
Ph.D. Thesis, Primary publication

Preview

PDF
gaertner.pdf
Copyright Information: In Copyright.
Download (808kB) | Preview

Item Type:

Ph.D. Thesis

Type of entry:

Primary publication

Title:

Formale Grundlagen der Fehlertoleranz in verteilten Systemen

Language:

German

Referees:

Kammerer, Prof.Dr. Peter ; Mattern, Prof.Dr. Friedemann

Advisors:

Kammerer, Prof.Dr. Peter

Date:

23 July 2001

Place of Publication:

Darmstadt

Date of oral examination:

17 May 2001

Abstract:

Fehlertolerante Systeme arbeiten auch dann noch korrekt, wenn während ihrer Ausführung eine bestimmte Anzahl von Fehlern auftreten. Um fehlertolerante Systeme zu entwickeln und zu validieren, benötigt man eine wissenschaftlich gesicherte Methodik. Bestandteile einer solchen Methodik sind Verfahren der Modellbildung, Entwurfsmuster für Fehlertoleranzverfahren und vorgefertigte algorithmische Grundbausteine. Diese Arbeit trägt zu jedem dieser drei Bereiche bei. Im Rahmen einer allgemeinen Modelluntersuchung werden in Kapitel 2 vier wichtige Systemmodelle für fehlertolerante verteilte Systeme vorgestellt und miteinander verglichen: das (partiell) synchrone Modell von Dwork, Lynch und Stockmeyer, das Fehlerdetektormodell von Chandra und Toueg, das zeitbeschränkte asynchrone Modell von Cristian und Fetzer, sowie das quasi-synchrone Modell von Almeida, Verissimo und Casimiro. Kapitel 3 untersucht die grundsätzliche Funktionsweise von Fehlertoleranzverfahren im Rahmen der Fehlertoleranztheorie von Arora und Kulkarni. In dieser Theorie entstehen fehlertolerante Programme aus der Komposition eines fehler-intoleranten Programms mit Fehlertoleranzkomponenten. Kapitel 3 erweitert diese Theorie um eine präzise Begrifflichkeit der Redundanz. Es werden zwei grundlegende Formen von Redundanz identifiziert (Speicherredundanz und Zeitredundanz) und es wird gezeigt, welche Art von Redundanz notwendig ist, um auch im Fehlerfall bestimmte Eigenschaften zu erfüllen. Kapitel 4 beschäftigt sich mit Lösungen des Beobachtungsproblems in verteilten Systemen, in denen Fehler auftreten können. Beobachten heißt hierbei, zu wissen, ob ein globales Prädikat während der Ausführung eines Algorithmus gilt oder nicht. Es werden zwei neue Beobachtungsmodalitäten definiert für die Entdeckung allgemeiner Prädikate in asynchronen Systemen, in denen crash-Fehler auftreten können. Die neuen Modalitäten können als Erweiterungen der aus den fehlerfreien Systemen bekannten Beobachtungsmodalitäten `possibly' und `definitely' von Cooper, Marzullo und Neiger angesehen werden. Abschließend werden Entdeckungsalgorithmen für die neuen Modalitäten vorgestellt und verifiziert.

Alternative Abstract:

Alternative Abstract

Language

A system is fault-tolerant if it maintains some form of correctness in the presence of faults. Fault-tolerant systems are necessary in many application areas of computing, especially where the failure of a computer system may lead to considerable damage. However, the development of fault-tolerant systems is a difficult task and requires rigorous and scientifically sound engineering methodologies. Amoung other topics, these methodologies must cover (1) system and fault modelling, (2) design patterns for fault-tolerance mechanisms and (3) basic building blocks for fault-tolerant algorithms. This thesis contributes to research in all of these three areas. Questions of system and fault modelling are treated in chapter 2 in which the four most prominent system models for fault-tolerant distributed systems are presented and compared: the model of partial synchrony by Dwork, Lynch and Stockmeyer, the failure detector model of Chandra and Toueg, the timed asynchronous system model of Cristian and Fetzer, and the quasi-synchronous model by Almeida, Verissimo and Casimiro. Chapter 3 focusses on the underlying principles of fault-tolerant systems. Starting point is the fault-tolerance theory of Arora and Kulkarni. In this theory, a fault-tolerant program is regarded as the composition of a fault-intolerant program and fault-tolerance components. We analyse the assumptions of the theory and extend it by defining formal notions of redundancy. Two forms of redundancy are identified (redundancy in space and redundancy in time) and we study which forms of redundancy are necessary to maintain which properties in the presence of faults. In chapter 4 we develop algorithmical building blocks for observation in faulty environments, a central problem in fault-tolerant computing which has not enjoyed a lot of research attention yet. Observation here means detecting whether or not a boolean predicate on global states holds during the execution of a distributed system. In the asynchronous system model with crash failures, we introduce two new observation modalities called `negotiably' and `discernibly' (which roughly correspond to the well-known modalities `possibly' and `definitely' by Cooper, Marzullo and Neiger) and present detection algorithms for them under increasingly weak fault assumptions.

English

Uncontrolled Keywords:

verteiltes System, Beobachtungsmodalität, Übereinstimmungsproblem

Alternative keywords: