Efficient and Low-Cost Fault Tolerance for Web-Scale Systems.
Technische Universität, Darmstadt
[Ph.D. Thesis], (2010)
Available under Creative Commons Attribution Non-commercial No Derivatives, 2.5.
Download (1MB) | Preview
|Item Type:||Ph.D. Thesis|
|Title:||Efficient and Low-Cost Fault Tolerance for Web-Scale Systems|
Online Web-scale services are being increasingly used to handle critical personal information. The trend towards storing and managing such information on the “cloud” is extending the need for dependable services to a growing range of Web applications, from emailing, to calendars, storage of photos, or ﬁnance. This motivates the increased adoption of fault-tolerant replication algorithms in Web-scale systems, ranging from classic, strongly-consistent replication in systems such as Chubby [Bur06] and ZooKeeper [HKJR10], to highly-available weakly-consistent replication as in Amazon’s Dynamo [DHJ+07] or Yahoo!’s PNUTS [CRS+08]. This thesis proposes novel algorithms to make fault-tolerant replication more efficient, available and cost eﬀective. Although the proposed algorithms are generic, their goals are motivated by fulﬁlling two major needs of Web-scale systems. The ﬁrst need is tolerating worst-case failures, which are also called Byzantine in the literature after the deﬁnition of [LSP82a], in order to reliably handle critical personal information. The second need is investigating proper weak consistency semantics for systems that must maximize availability and minimize performance costs and replication costs without relaxing consistency unnecessarily. Byzantine-Fault Tolerance: There has been a recent burst of research on Byzantine-Fault Tolerance (BFT) to make it have performance and replication costs that are feasible and comparable to the fault-tolerance techniques already in use today. BFT is typically achieved through state-machine replication, which implements the abstraction of a single reliable server on top of multiple unreliable replicas [Sch90]. This line of research ultimately aimed at showing the feasibility of this approach for Web-scale systems [CKL+09] to protect these critical systems from catastrophic events such as [Das]. This thesis proposes novel algorithms to reduce the performance and replication costs of BFT. First, the thesis shows how to reduce the cost of BFT without assuming trusted components. After the seminal PBFT algorithm [CL99], a number of fast BFT algorithms, as for example [MA06; DGV04; KAD+07], have been proposed. These papers show the existence of an inherent tradeoﬀ between optimal redundancy and minimal latency in presence of faulty replicas. This is problematic in Web-scale systems, where Byzantine faults are very rare but where unresponsive (benign) replicas are commonplace. This thesis proposes a novel algorithm, called Scrooge, which reduces the replication costs of fast BFT replication in presence of unresponsive replicas. Scrooge shows that the additional replication costs needed for being fast in presence of faulty replicas are only dependent on the number of tolerated Byzantine faults, and not on the number of tolerated crashes. As an implication of this result, Scrooge is optimally resilient when it is conﬁgured to tolerate one Byzantine fault and any number of crashes. Such a conﬁguration is quite common since Byzantine faults are relatively unlikely to happen. This thesis then explores the advantages of using trusted components. It shows that these can lead to signiﬁcant latency and redundancy costs in practical asynchronous systems [SS07]. This dispelled the belief that trusted components need to be combined with synchronous links to achieve cost reductions, as hinted by previous work [CNV04; Ver06] . This additional assumption makes previously proposed algorithms unpractical in many settings, including Web-scale systems. In three-tiered Web-scale systems, for example, one could just leverage the fact that servers in the ﬁrst tier (the Web-servers) are typically more stable, standardized and less prone to vulnerabilities than application servers. The HeterTrust protocol, which is presented in this thesis, uses trusted components without assuming synchronous links. It protects data conﬁdentiality using a number of replicas that is linear in the number of tolerated faults and has a constant time complexity. This is a signiﬁcant improvement over existing approaches which do not rely on trusted component but entail quadratic redundancy costs and linear latency [YMV+03]. Furthermore, diﬀerent from existing work on conﬁdential BFT, HeterTrust uses only symmetric-key cryptography instead of public-key signatures. HeterTrust features some interesting ideas related to speculation [KAD+07] and tolerance to denial-of-service attacks [ACKL08; CWA+09] that have been further developed by work published immediately after [SS07]. In parallel to this thesis’ work, the use of trusted components in asynchronous systems was also independently explored in [CMSK07]. Weak consistency: Some replicated Web-scale applications cannot aﬀord strong consistency guarantees such as Linearizability [HW90]. The reason is the impossibility of implementing shared ob jects, as for example databases, that are available in presence of partitions or asynchrony [GL02]. With few exceptions, however, all these systems relax Linearizability even in periods when there are no partitions nor asynchrony and no relaxation is needed to keep the system available. Since this relaxation is problematic for many applications, recent research is focusing on stronger consistency guarantees which can be combined with high availability. This thesis introduces a novel consistency property, called Eventual Linearizability, which allows Linearizability to be violated only for ﬁnite windows of time. This thesis also describes Aurora, an algorithm ensuring Linearizability in periods when a single leader is present in the system. Aurora is graceful ly degrading because it uses a single failure detector and obtains diﬀerent properties based on the actual strength of this failure detector, which is not known a priori. For Eventual Linearizability, a <>S failure detector is needed. In periods of asynchrony when links are untimely and no single leader is present, Aurora gracefully degrades to Eventual Consistency [FGL+96; Vog09] and Causal Consistency [Lam78]. For these property, Aurora only relies on a strongly complete failure detector C . In order to complete strong operations, which must be always linearized, a <>P failure detector is used. This is stronger than <>S , the weakest failure detector needed to implement consensus [CHT96], and thus linearizable shared objects. This thesis shows that there exists an inherent cost in combining Eventual Linearizability with Linearizability.
|Place of Publication:||Darmstadt|
|Classification DDC:||000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik|
|Divisions:||20 Department of Computer Science
20 Department of Computer Science > Dependable Embedded Systems & Software
20 Department of Computer Science > Theoretische Informatik
20 Department of Computer Science > Algorithmics
|Date Deposited:||23 Sep 2010 11:04|
|Last Modified:||07 Dec 2012 11:58|
|Referees:||Suri, Prof. Neeraj and Rodrigues, Prof. Rodrigo|
|Refereed:||16 September 2010|