C O M P I L AT I O N F O R M O R E P R A C T I C A L S E C U R E M U LT I - PA RT Y C O M P U TAT I O N Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) von Niklas Büscher, M.Sc. geboren in Münster Referenten: Prof. Dr. Stefan Katzenbeisser Prof. Dr. Florian Kerschbaum Tag der Einreichung: 21.11.2018 Tag der Prüfung: 11.01.2019 D 17 Darmstadt, 2018 Dieses Dokument wird bereitgestellt von tuprints, E-Publishing-Service der TU Darmstadt. http://tuprints.ulb.tu-darmstadt.de tuprints@ulb.tu-darmstadt.de Bitte zitieren Sie dieses Dokument als: URN: urn:nbn:de:tuda-tuprints-84950 URL: https://tuprints.ulb.tu-darmstadt.de/id/eprint/8495 Die Veröffentlichung steht unter folgender Creative Commons Lizenz: Attribution – NonCommercial – NoDerivatives 4.0 International (CC BY-NC-ND 4.0) http://creativecommons.org/licenses/by-nc-nd/4.0/ http://tuprints.ulb.tu-darmstadt.de mailto:tuprints@ulb.tu-darmstadt.de urn:nbn:de:tuda-tuprints-84950 https://tuprints.ulb.tu-darmstadt.de/id/eprint/8495 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B S T R A C T Within the last decade, smartphone applications and online services became universal resources and an integral part of nowadays life. Un- fortunately, the ongoing digitization trend is also a huge risk for the individual’s privacy because users of these interconnected devices and services become more and more transparent and reveal sensi- tive data to an untrusted and possibly unknown service provider. Yet, since the 1980’s it is known that any computation between two or more parties can be evaluated securely such that the parties do not learn more about the inputs of the other parties than they can derive from the output of the computation. For a long time considered to be a purely theoretical concept, in the last fifteen years, this technique known as Secure Multi-Party Computation (MPC), transitioned into a powerful cryptographic tool to build privacy-enhancing technol- ogy. As such MPC could prevent mass surveillance of online services while maintaining the majority of their business use cases. Further- more, MPC could be an enabler for novel business-to-business use cases, where mutually distrusting parties corporate by sharing data without losing control over it. Albeit its potential, the practicality of MPC is hindered by the difficulty to implement applications on top of the underlying cryptographic protocols. This is because their man- ual construction requires expertise in cryptography and hardware de- sign. The latter is required as functionalities in MPC are commonly expressed by Boolean and Arithmetic circuits, whose creation is a complex, error-prone, and time-consuming task. To make MPC accessible to non-domain experts, in this thesis we design, implement, and evaluate multiple compilation techniques that translate the high-level language ANSI C into circuit representations optimized for different classes of MPC protocols. Split in two parts, we focus on Boolean circuit based protocols in the first part of this thesis. We begin with an introduction into compilation and optimiza- tion of circuits with minimal size, which is required for constant round MPC protocols over Boolean circuits, such as Yao’s Garbled Circuits protocol. For this purpose, we identify and evaluate classic logic minimization techniques for their application in compilation for MPC. Then, we present compiler assisted parallelization approaches for Yao’s protocol that distribute the computational workload onto multiple processors, which can allow a faster or possibly more en- ergy efficient protocol evaluation. By extending the protocol, we fur- ther show that parallelization leads to speed-ups even in single-core settings. As not only size minimization is of relevance for MPC, we also propose a compilation chain for the creation of depth-minimized iii Boolean circuits, optimized for their use in multi-round protocols, such as the GMW protocol. For this purpose, we propose and im- plement new hand-optimized building blocks as well as code and circuit minimization techniques. In most cases the presented compil- ers create applications from high-level source code that outperform previous (hand-optimized) work. In the second part, we introduce compilers for two advanced hy- brid MPC protocols. First, we study the creation of MPC applica- tions using multiple (standalone) MPC protocols at once. By com- bining protocols with different paradigms, e.g., Boolean and Arith- metic circuits based protocols, faster applications can be created. For the compilation of these hybrid applications we design and present novel code decomposition and optimization techniques. Moreover, we introduce solutions to the protocol selection problem to efficiently combine multiple protocols. Thus, we are able to present the first compiler that achieves full automatization from source code to hy- brid MPC. Second, we investigate compilation for the combination of Oblivious RAM with MPC, also known as RAM based secure compu- tation (RAM-SC). RAM-SC is required in data intensive applications, where circuit based protocols show limited scalability. A multitude of ORAMs based on different design principles with different trade-offs has been proposed. We explore all these design principles and corre- sponding deployment costs in different scenarios, before introducing a compiler that identifies an optimal selection of ORAM schemes for a given input source code. As such, we present the first fully autom- atized compile chain for RAM-SC programs. In summary, we contribute in making MPC practical by improving both, efficiency and automatized application generation. Z U S A M M E N FA S S U N G Eine der größten technischen Entwicklungen in den letzten 15 Jah- ren ist der Fortschritt des Smartphones, die ständige Verfügbarkeit von Informationen und Konnektivität, sowie die damit verbundene Nutzung von mächtigen Onlinediensten. Leider birgt dieser Digita- lisierungstrend auch große Risiken für die Privatsphäre der Nutzer, denn diese werden immer transparenter für den teilweise unbekann- ten Dienstanbieter. Es ist jedoch bereits seit Mitte der 1980er-Jahre be- kannt, dass jegliche Art von Berechnung oder Interaktion zwischen zwei oder mehreren Parteien auch privatsphärefreundlich durchge- führt werden kann, so dass die teilnehmenden Parteien nicht mehr über die Eingaben der anderen Parteien lernen können, als das was sie sich von der gemeinsame Ausgabe ableiten können. Lange war diese Idee der sicheren Mehrparteienberechnung (MPC) nur ein theo- retisches Konzept, jedoch hat es sich in den letzten 15 Jahren in ein iv sehr praktisches kryptographisches Werkzeug entwickelt, um privat- sphäreschützende Anwendungen zu realisieren. Als solches könnte MPC helfen die Massenüberwachung der Onlinedienste zu unterbin- den, ohne deren Geschäftsmodelle komplett außer Kraft zu setzen. MPC ermöglicht auch neuartige Geschäftsmodelle, so dass beispiels- weise zwei sich nicht vertrauende Unternehmen kooperieren und Daten teilen, ohne dabei die Kontrolle über diese Daten an den Ge- schäftspartner zu verlieren. Trotz des immensen Potentials von MPC findet es kaum Anwendung in der Praxis. Dies liegt vor allem an der Komplexität und dem Aufwand, um Anwendungen für existierende MPC Protokolle zu entwickeln. Die Erstellung dieser Anwendungen benötigt nämlich umfangreiche Kenntnisse der Kryptographie und des Logikdesigns (Schaltungsbau). Letzteres wird benötigt, da Funk- tionen in MPC üblicherweise in Form von logischen oder arithmeti- schen Schaltungen dargestellt werden, deren händische Erzeugung sehr fehleranfällig und zeitaufwendig ist. Um MPC auch Nicht-Experten zugängliche zu machen, haben wir im Rahmen dieser Arbeit verschiedene Übersetzungstechniken (Com- piler) entwickelt, implementiert und evaluiert, die es ermöglichen standard ANSI C Programmcode automatisiert in Schaltungen für verschiedene Klassen von MPC Protokollen zu übersetzen. Diese Ar- beit besteht aus zwei Teilen, wobei sich der erste Teil mit der Erzeu- gung von logischen Schaltkreisen beschäftigt. Wir beginnen mit einer Betrachtung der Konstruktion von Schaltungen mit minimaler Grö- ße. Diese werden üblicherweise in MPC Protokollen mit konstanter Rundenanzahl eingesetzt, wie beispielsweise Yao’s Garbled Circuits Protokoll. Zu diesem Zweck analysieren und adaptieren wir klas- sische Logikminimierungstechniken für MPC. Anschließend zeigen wir, wie Yao’s Protokoll mit Hilfe eines Compilers effektiv paralleli- siert werden kann, um die Protokollausführung auf Mehrprozessor- systemen zu beschleunigen. Weiterhin präsentieren wir eine Protokol- lerweiterung, die selbst auf einem einzigen Rechenkern die Berech- nungszeit durch Parallelisierung reduzieren kann. Neben der Mini- mierung der Schaltungsgröße, ist die Minimierung der Schaltungstie- fe von besondere Relevanz für MPC Protokolle mit variabler Runden- anzahl, wie beispielsweise dem GMW Protokoll. Für diese Art von Protokollen präsentieren wir einen Übersetzungsansatz, bestehend aus neuen handminimierten Schaltungsbausteinen und Schaltungs- minimierungstechniken. Die im Rahmen dieser Arbeit entstandenen Compiler, die die vorgestellten Techniken implementieren, erzeugen Schaltungen aus einer Hochsprache, die meist schneller evaluiert wer- den können als die zuvor vorgestellten und oft händisch optimierten Schaltungen. Im zweiten Teil dieser Arbeit beschäftigen wir uns mit der Schal- tungserzeugung für hybride MPC Protokolle. Hier zeigen wir zum einen die automatisierte Konstruktion von Anwendungen, die ver- v schiedene MPC Protokolle in einer Anwendung kombinieren. Durch die Kombination von Protokollen mit verschiedenen Designprinzipi- en, beispielsweise logische und arithmetische Schaltungen, können Anwendungen weiter beschleunigt werden. Zur Erzeugung dieser hy- briden Anwendungen präsentieren wir neue Codezerlegungs- und Optimierungstechniken. Weiterhin stellen wir Lösungen vor, welche die effizienteste Kombination von Protokollen für eine gegebene An- wendungszerlegung identifizieren kann. Mit diesem Beitrag können wir den ersten Compiler präsentieren, der eine vollständige Auto- matisierung von Programmcode zu hybriden Anwendungen erzielt. Als zweite hybride Protokollklasse betrachten wir die Compilierung von RAM basierten Protokollen. Diese so genannten RAM-SC Pro- tokolle kombinieren klassische MPC Protokolle mit Oblivious RAM (ORAM) Techniken und werden für datenintensive Anwendungen benötigt. Da existierende ORAMs verschiedenste Designprinzipien verwenden, präsentieren wir eine umfangreiche ORAM Kostenana- lyse mit der die Laufzeit von RAM-SC in beliebigen Einsatzszenari- en abgeschätzt werden kann. Diese Abschätzung ermöglicht es uns, den ersten voll automatisierten Compiler für RAM-SC vorzustellen, der sämtliche Speicherzugriffe in einem gegebenen Programmcode erfasst und die effizientesten ORAMs ermitteln kann. Zusammenfassend trägt diese Arbeit dazu bei, MPC praxistaug- licher und zugänglicher für Nicht-Experten zu machen, indem die Entwicklungsprozesse automatisiert und die Effizienz von MPC An- wendungen gesteigert wird. vi E R K L Ä R U N G Hiermit erkläre ich, dass ich die vorliegende Arbeit – abgesehen von den in ihr ausdrücklich genannten Hilfen – selbständig verfasst habe. Niklas Büscher A K A D E M I S C H E R W E R D E G A N G 11/2013 - 10/2018 Technische Universität Darmstadt Wissenschaftlicher Mitarbeiter Promotionsstudium 04/2011 - 10/2013 Technische Universität Darmstadt Studium der Informatik Abschluss: Master of Science 08/2011 - 05/2012 University of Boulder, Colorado, USA Studium der Informatik 04/2008 - 03/2011 Technische Universität Darmstadt Studium der Informatik Abschluss: Bachelor of Science vii P U B L I C AT I O N S This thesis is based on the following publications: 1. Niklas Büscher and Stefan Katzenbeisser. “Faster Secure Compu- tation through Automatic Parallelization”. In: Proceedings of the 24th USENIX Security Symposium, USENIX Security 2015, Washington, D.C., USA. 2. Niklas Büscher, Andreas Holzer, Alina Weber, and Stefan Kat- zenbeisser. “Compiling Low Depth Circuits for Practical Secure Com- putation”. In: Proceedings of the 21st European Symposium on Research in Computer Security, ESORICS 2016, Heraklion, Greece. 3. Niklas Büscher, David Kretzmer, Arnav Jindal, and Stefan Kat- zenbeisser. “Scalable Secure Computation from ANSI-C”. In: Pro- ceedings of the 8th IEEE International Workshop on Informa- tion Forensics and Security, WIFS 2016, Abu Dhabi, United Arab Emirates. 4. Niklas Büscher, Martin Franz, Andreas Holzer, Helmut Veith, and Stefan Katzenbeisser. “On Compiling Boolean Circuits Opti- mized for Secure Multi-Party Computation”. In: Formal Methods in System Design 51(2). 5. Niklas Büscher and Stefan Katzenbeisser, “Compilation for Secure Multi-Party Computation”, Springer Briefs in Computer Science 2017, ISBN 978-3-319-67521-3. 6. Niklas Büscher, Alina Weber, and Stefan Katzenbeisser, “Towards Practical RAM-based Secure Computation”. In: Proceedings of the 23rd European Symposium on Research in Computer Security, ESORICS 2018, Barcelona, Spain. 7. Niklas Büscher, Daniel Demmler, Stefan Katzenbeisser, David Kretzmer, and Thomas Schneider. “HyCC: Compilation of Hybrid Protocols for Practical Secure Computation.” In: Proceedings of the 25th ACM SIGSAC Conference on Computer and Communica- tions Security, CCS 2018, Toronto, Canada. Further peer-reviewed publications published during my doctoral studies: 8. Johannes A. Buchmann, Niklas Büscher, Florian Göpfert, Ste- fan Katzenbeisser, Juliane Krämer, Daniele Micciancio, Sander viii Siim, Christine van Vredendaal, Michael Walter. “Creating Cryp- tographic Challenges Using Multi-Party Computation: The LWE Chal- lenge”. In: Proceedings of the 3rd ACM International Workshop on ASIA Public-Key Cryptography, AsiaPKC@AsiaCCS 2016, Xi’an, China. 9. Niklas Büscher, Stefan Schiffner, Mathias Fischer. “Consumer Pri- vacy on Distributed Energy Markets”. In: Proceedings of the Pri- vacy Technologies and Policy - 4th Annual Privacy Forum, APF 2016, Frankfurt, Germany. 10. Florian Kohnhäuser, Niklas Büscher, Sebastian Gabmeyer, Ste- fan Katzenbeisser. “SCAPI: A Scalable Attestation Protocol to De- tect Software and Physical Attacks”. In: Proceedings of the 10th ACM Conference on Security and Privacy in Wireless and Mo- bile Networks, WiSec 2017, Boston, USA. 11. Niklas Büscher, Spyros Boukoros, Stefan Bauregger, Stefan Kat- zenbeisser. “Two Is Not Enough: Privacy Assessment of Aggregation Schemes in Smart Metering”. In: Proceedings of the 17th Privacy Enhancing Technologies Symposium, PETS 2017, Minneapo- lis, USA. 12. Florian Kohnhäuser, Niklas Büscher, Stefan Katzenbeisser. “SALAD: Secure and Lightweight Attestation of Highly Dynamic and Disruptive Networks”. In: Proceedings of the 13th ACM Asia Con- ference on Computer and Communications Security, AsiaCCS 2018, Incheon, Republic of Korea. ix A C K N O W L E D G M E N T S Being at the final step of the doctorate allows me to thank every- one who supported me along the way. Foremost, I am very grateful to my advisor Stefan Katzenbeisser, for his invaluable guidance in and beyond research, the enjoyable working environment, and espe- cially for the many mountaineering trips to Kleinwalsertal. Further- more, I thank Florian Kerschbaum, for agreeing to be the second re- viewer and even more for going to any length in trying to keep me in academia. My thank also goes to Christian Reuter, Thomas Schneider, and Neeraj Suri for joining the defense committee. I owe my first contacts to research to Marc Fischlin and Michael Goesele, who both were independently willing to push undergradu- ate projects, jointly tackled with Benjamin, into papers and sponsor- ing our first conference visit. During my masters I was very warmly welcomed into the TK security group led by Stefan and Mathias, who gave me a first introduction into privacy research. Then, during my doctoral studies, I had the honor to work with ex- cellent colleagues in the SecEng team. Without revealing any details, e.g., with whom I shared the most coffee breaks, I have to say that it was my pleasure to work with André, Christian, Dominik, Erik, Florian, Heike, Marius, Markus, Nikolaos I., Nikolaos II., Nikolay, Philipp, Sebastian B., Sebastian G., Spyros, Tolga, and Ursula. Sim- ilarly, I would like to thank the ENCRYPTO group, namely, Ágnes, Amos, Christian, Daniel, Michael, Oleksandr, and Thomas for accept- ing me as an alien MPC researcher, sharing many conference visits, lunches, and coffees, which frequently led to fruitful discussions. Likewise, I am very thankful for my co-authors and students, most notably Alina for her passion and endurance in performing circuitry, Andreas for introducing me to CBMC-GC, which provides a founda- tion for the tools developed as part of this thesis, and David for his commitment in compiler development as well as his exceptional ex- pertise in C++ and information flow. Many thanks go to my family for their continuous encouragement, teaching me open-mindedness, and the importance of education. But most of all, I am grateful for my wife Katharina, especially for her never-ending support and patience throughout the past years. x C O N T E N T S 1 introduction 1 1.1 Research Goal and Contribution . . . . . . . . . . . . . 3 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . 5 i preliminaries 7 2 basic techniques 8 2.1 Digital Logic and Boolean Circuits . . . . . . . . . . . . 8 2.2 Secure Multi-Party Computation . . . . . . . . . . . . . 10 2.3 Oblivious RAM and RAM-SC . . . . . . . . . . . . . . . 17 2.4 Compilers for MPC . . . . . . . . . . . . . . . . . . . . . 22 2.5 Benchmarking Applications for MPC Compilers . . . . 25 3 from ansi c code to boolean circuits 32 3.1 Motivation and Overview . . . . . . . . . . . . . . . . . 32 3.2 CBMC-GC’s Compilation Chain . . . . . . . . . . . . . 34 3.3 Complexity of Operations in MPC . . . . . . . . . . . . 42 ii compilation and optimization for boolean cir- cuit based mpc protocols 45 4 compiling size-optimized circuits for constant- round mpc protocols 46 4.1 Motivation and Overview . . . . . . . . . . . . . . . . . 46 4.2 Circuit Minimization for MPC . . . . . . . . . . . . . . 49 4.3 Building Blocks for Boolean Circuit Based MPC . . . . 50 4.4 Gate-Level Circuit Minimization . . . . . . . . . . . . . 54 4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . 58 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 63 5 compiling parallel circuits 65 5.1 Motivation and Overview . . . . . . . . . . . . . . . . . 65 5.2 Parallel Circuit Evaluation . . . . . . . . . . . . . . . . . 67 5.3 Compiler Assisted Parallelization Heuristics . . . . . . 68 5.4 Inter-Party Parallelization . . . . . . . . . . . . . . . . . 75 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . 81 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 92 6 compiling depth-optimized circuits for multi-round mpc protocols 94 6.1 Motivation and Overview . . . . . . . . . . . . . . . . . 94 6.2 Compilation Chain for Low-Depth Circuits . . . . . . . 96 6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . 106 6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 115 xi iii compilation and optimization for hybrid mpc protocols 116 7 compilation of hybrid protocols for practical secure computation 117 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2 The HyCC MPC Compiler . . . . . . . . . . . . . . . . . 121 7.3 Protocol Selection and Scheduling . . . . . . . . . . . . 131 7.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . 136 7.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 140 8 compilation for ram-based secure computation 148 8.1 Motivation and Overview . . . . . . . . . . . . . . . . . 148 8.2 Analysis and Optimization of ORAMs for Secure Com- putation . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.3 Automatized RAM-SC . . . . . . . . . . . . . . . . . . . 156 8.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . 158 8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 165 9 conclusion and future work 167 bibliography 170 xii A C R O N Y M S aig AND-Invert Graph abb Arithmetic Black Box asap As-soon-as-possible ast Abstract Syntax Tree bmc Bounded Model Checking cfg Control Flow Graph cgp Coarse-Grained Parallelization cnf Conjunctive Normal Form cnn Convolutional Neural Network csa Carry-Save Adder csn Carry-Save Network dag Directed Acyclic Graph dnf Disjunctive Normal Form dpf Distributed Point Function dsl Domain Specific Language fa Full Adder fgp Fine-Grained Parallelization fss Function Secret Sharing grr Garbled Row Reduction ha Half Adder ipp Inter-Party Parallelization lsb Least Significant Bit mpc Secure Multi-Party Computation oram Oblivious RAM pet Privacy-Enhancing Technology pir Private Information Retrieval xiii ppa Parallel Prefix Adder prf Pseudo Random Function rca Ripple Carry Adder rtt Round Trip Time se Symbolic Execution simd Single-Instruction-Multiple-Data ssa Single Static Assignment tpc Secure Two-Party Computation xiv “Die [...] Gestaltung von Datenverarbeitungssystemen sind an dem Ziel auszurichten, so wenig personenbezogene Daten wie möglich zu erheben, zu verarbeiten oder zu nutzen [...], soweit dies nach dem Verwendungzweck möglich ist und keinen im Verhältnis zu dem angestrebten Schutz- zweck unverhältnismäßigen Aufwand erfordert.” — Bundesdatenschutzgesetz (BDSG) §3a Datenvermeidung und Datensparsamkeit 1 I N T R O D U C T I O N The continuous grow of data gathering and processing, which is fired by cheap sensors (e.g., in smart phones and wearables), cheap storage costs, and efficient machine learning algorithms enables many use- ful applications and powerful online services. However, this form of heavy data processing, often referred to as ‘Big Data’, is also a huge risk for the individual’s privacy because users of these services be- come more and more transparent and reveal possibly sensitive data to an untrusted and possibly unknown service provider. Even when not assuming any malicious interest of the data collectors, the gath- ered data is often centrally stored, and thus prone to many possible breaches, e.g., hackers exploiting vulnerabilities, maliciously acting employees, requests by state agencies, or insufficient data disposal management. Currently, all these risks are mostly ignored by the service providers as well as the users. The former have (often legitimate) commercial in- terests in their business use cases and the latter are either unaware of the actual value of their data, lack alternatives, or have no possibilities to opt-out if they want to participate in nowadays life. Consequently, we observe a sincere conflict between business interest and the right on privacy when processing personal data. Yet, there is a solution to this dilemma: Namely, since the 1980’s it is (theoretically) known that any computation over sensitive data from multiple parties can be performed securely, such that the participating parties do not learn more about the inputs of the other parties from the computation than they can already derive from the output [Yao82; Yao86]. This form of computation is known as Secure Computation or Secure Multi-Party Computation (MPC) and can be explained by an example application: Consider two parties Alice and Bob that want to schedule a joint meeting. Yet, Alice does not want to reveal her own schedule, i.e., her availability, to Bob, as it might contain sensitive information. For the same reasons, Bob prefers to maintain his schedule secret. As a solu- tion they can involve a third person, e.g., Charlie, whom they both 1 introduction 2 trust. Alice and Bob can send their private schedule, i.e., availabil- ity slots, to Charlie, who identifies a matching time slot and reveals this slot to Alice and Bob. Unfortunately, a third party that is trusted by both parties barely exists in practice. MPC solves this problem by simulating1 such a trusted third party using cryptographic protocols. These protocols are run between the two parties, guarantee correct- ness of the computation and privacy of the users’ data. Consequently, MPC is a powerful Privacy-Enhancing Technology (PET), as the data at the service providers can be processed under encryption, while almost all functionality required for business use cases can be main- tained. Naturally, the questions arises: If such powerful cryptographic tech- niques are available, why are they barely used in practice? The answer is twofold. First, MPC protocols have noticeable deployment costs in computation and communication. Second, creating, i.e., developing efficient applications for MPC by hand, as it had often be done pre- viously, is a tedious and error-prone task. In this thesis, we address both challenges by presenting a multitude of compilation techniques and implement these in compilers that automatically synthesize optimized MPC applications for different deployment scenarios from high-level code. By au- tomatizing the creation of MPC applications we significantly lower the entry barrier to MPC. Our work is of practical relevance because of two main reasons: First, applied cryptography and IT infrastructures in general are get- ting more and more complex. To handle this growing complexity, sim- pler developer interfaces and further automatization are desirable to create secure applications. This situation has been identified by indus- try and academia alike. For example, replacements for standard cryp- tography libraries have been proposed that provide clearer interfaces to developers, e.g., [BLS12], and also dedicated tools have been de- veloped that synthesize secure cryptographic implementations, e.g., [Krü+17]. Thus, automated compilation for MPC contributes in this direction, as it reduces the complexity when designing new PETs. Second, the right to privacy is known for ages in most civiliza- tions and therefore most countries have issued data protection laws that regulate the (commercial) use of data with different scopes and strengths. For example, an excerpt of the German privacy law, which has recently been adapted to the European General Data Protection Regulation (GDPR), is printed at the beginning of this chapter. Loosely translated, this excerpt says that all data processing systems should be developed with the goal to use, store, and process as little per- sonal data as possible, if this is reasonable given the technical efforts. The proof-of-concept compilers developed as part of this thesis illustrate that the technical efforts to implement privacy-preserving computa- tion are reasonable and consequently the use of MPC in highly sensi- 1 Not meant in the cryptography way. 1.1 research goal and contribution 3 tive areas, e.g., genome processing or the handling of health records, should be considered. 1.1 research goal and contribution The goal of this thesis is the design, development, and evaluation of compilation techniques for efficiency improvements of MPC, when compiling from a high-level programming language. In this section, we specify this goal, outline our contributions, and illustrate the con- nection to related work. secure multi-party computation. In MPC, two main re- search directions are distinguished. First, MPC protocols dedicated for specific applications have been developed, e.g., for recommender systems [KP08] or privacy-preserving face recognition [Erk+09]. Sec- ond, generic protocols have been proposed that allow to perform any computation between two or more parties securely2. Dedicated pro- tocols require cryptographers to prove their correctness and security, yet in principal have the possibility to outperform the generic tech- niques. Generic protocols only need to be proven secure once and then offer significantly more versatility because any application can realized on top of these protocols without further security proofs, and thus without expert knowledge in cryptography. Due to these reasons, many generic protocols using different cryptographic prim- itives have been proposed, e.g., [Ara+17; BLW08; Dam+12; GMW87; Yao82], and also many theoretical and practical optimizations, e.g., [Bel+13; KS08; Pin+09; ZRE15] made these protocols ready for prac- tice. The focus of this thesis is the compilation for generic MPC pro- tocols.3 application descriptions and compilation. At the heart of (generic) MPC is the ability to efficiently represent the function- ality to be computed. Most existing protocols require either a for- mulation in terms of a Boolean or an Arithmetic circuit. Circuits are acyclic graphs, where values flow along the edges, i.e., ‘wires’, and are processed at nodes, representing either Boolean functions (in case of Boolean circuits), e.g., logical AND, or basic Arithmetic operations (in case of Arithmetic circuits). Hand-coding these circuits, i.e., effi- ciently formulating the function to be computed in terms of basic Boolean or Arithmetic operations, is practically infeasible even for moderately complex functions, as the manual construction of efficient 2 The term secure is defined more precisely in Section 2.2. 3 The tools developed as part of this thesis are creating applications for MPC with only two parties, commonly referred to as Secure Two-Party Computation (TPC). Nevertheless, we remark that the ideas presented in this work transfer for protocols with any number of parties. Because of this and for the sake of generality we will use the term MPC in the remainder of this thesis. 1.1 research goal and contribution 4 circuits for MPC is a complex and time-consuming task that requires expert knowledge in hardware design. Therefore, the first practical framework (protocol implementation) [Mal+04] for Yao’s Garbled Cir- cuits protocol [Yao86], which is one of the most studied MPC pro- tocols, also provided a rudimentary compiler for a domain specific language. In following works, e.g., [Hol+12; Hua+11b], compilation has been identified as an independent task, separated from MPC pro- tocol implementations. Since then, a multitude of compilers has been proposed. We give a detailed overview and classification of these in Section 2.4. research goal In this thesis, we study compilation approaches that translate a high-level language (ANSI C) into optimized circuits suiting the requirements of different classes of MPC protocols with a major focus on the optimization for Boolean circuit based protocols. As such, we aim at compiling circuit descriptions that are compara- ble or better than previous handmade constructions, given a suitable high-level description. Consequently, we aim at increasing both, us- ability and efficiency of MPC. contributions . A major part of this thesis deals with the com- pilation and optimization for Boolean circuit based MPC protocols. Optimizing the created circuits is necessary, as MPC is still multi- ple orders of magnitude slower than generic computation. Therefore, we first study hand-optimized building blocks and a fixed-point op- timization algorithm that adapts circuit synthesis techniques, known from logic or chip design [Gaj+12; She93], for their use in compilation for MPC. We implement and evaluate these techniques in the ANSI C compiler CBMC-GC by Holzer et al. [Hol+12] and illustrate their capabilities to compile efficient circuits that are capable of outper- forming commercial hardware synthesis tools. As such, we present a compile-chain that creates small circuits from a standard program- ming language. These circuits are useful for all constant round MPC protocols that operate on Boolean circuits, e.g., Yao’s Garbled Circuits protocol. Although creating highly optimized applications, we observed that MPC in general, and Yao’s protocol in particular, still require substan- tial computational resources. Therefore, we study parallelization tech- niques for Yao’s protocol. We propose two parallelization approaches and introduce a new parallel protocol variant that allows to profit from parallelization even in single core settings. Furthermore, we present the first complete and automatized compile chain that cre- ates and partitions circuits from standard ANSI C code that can be evaluated efficiently in parallel MPC frameworks. While working on improving the application efficiency in Yao’s pro- tocol, many further MPC protocols also advanced and became of prac- 1.2 thesis outline 5 tical relevance. Especially multi-round protocols based on Boolean circuits, such as GMW [GMW86] and TinyOT [Nie+12] improved sig- nificantly due to the advancements in Oblivious Transfers [Ash+13]. Also newer and faster protocols, e.g., [Ara+17], following the same multi-round paradigm have been proposed. We address these devel- opments and present an optimizing compiler from ANSI C that cre- ates depth-minimized Boolean circuits for MPC. For this purpose, we present new depth-minimized building blocks and adapt high- and low-level optimization techniques, which allow to automatically gen- erate circuits that outperform previously handmade constructions. Recently also hybrid protocols, i.e., MPC protocols that combine different techniques, e.g., Boolean with Arithmetic circuits or Boolean circuits with Oblivious RAM (ORAM) [GO96], gained noticeable trac- tion [DSZ15; Gor+12; Liu+14]. Therefore, we further extend the pre- viously developed compilation chain to compile hybrid protocols. To achieve this goal and to also cope with the development of increas- ing circuit sizes, we present the first scalable compiler that allows to compile, optimize, and partition hybrid MPC protocols consisting of the combination of both Boolean and Arithmetic circuits. Finally, we study the compilation and optimization for ORAM as- sisted MPC protocols, also known as RAM-SC [Liu+14], which are necessary for data intensive applications. For this purpose, we present a cost analyses of relevant ORAM schemes, optimize these, and intro- duce the first optimizing compilation chain for RAM-SC. In summary, we make a wide range of MPC protocols accessible for non-domain experts. 1.2 thesis outline The remainder of this thesis is structured in three parts. In Part I – Preliminaries we discuss basic techniques and notations. In Chapter 2 we recap the basics of Boolean circuit design, give an overview of MPC protocols relevant for this work, and describe re- lated compilers for MPC. Moreover, we describe example applications that are used as MPC benchmarks throughout this thesis. Then, in Chapter 3 we describe a compiler, named CBMC-GC [Hol+12], which translates ANSI C into Boolean circuits. This compiler is adapted and extended in the following chapters to compile and optimize circuits for the different protocols. In Part II – Compilation and Optimization for Boolean Circuit based MPC Protocols we present compilation techniques that focus on stan- dalone MPC protocols that use Boolean circuits as application de- scriptions. In Chapter 4 we present our first contribution by study- ing multiple size-minimizing optimization techniques for their adap- tion in circuit synthesis for MPC. The created circuits lead to ef- 1.2 thesis outline 6 ficiency improvements for all constant-round MPC protocols over Boolean circuits. Next, in Chapter 5, we study the parallelization of Yao’s Garbled Circuits and present a compilation chain that cre- ates circuits, which can be efficiently evaluated in parallel. In Chap- ter 6 we present a compilation and optimization approach that com- piles depth-minimized Boolean circuits. These are required for multi- round MPC protocols that are deployed practical network environ- ments. In Part III – Compilation and Optimization for Hybrid MPC Protocols we extend our work towards more advanced MPC protocols. In Chap- ter 7 we study the compilation of hybrid MPC applications that con- sist of Boolean and Arithmetic circuits, which allows to mix constant and multi-round MPC protocols efficiently. Furthermore, we present a compilation chain for RAM-SC that suits the requirements of data intensive applications in Chapter 8. Finally, in Chapter 9 we conclude our work and give an outlook on future research directions in compilation for MPC. Part I P R E L I M I N A R I E S 2 B A S I C T E C H N I Q U E S Summary: In this chapter, we describe basic techniques required to fol- low the ideas presented in this thesis. First, in Section 2.1 we describe the circuit computation model, and recap the properties of Boolean circuits, which are predominately used in this thesis. Then, in Section 2.2, we de- scribe three of the most studied secure computation protocols, namely Yao’s Garbled Circuits protocol and the Goldreich-Micali-Wigderson (GMW) pro- tocol, which are prototypic for constant-round and multi-round MPC proto- cols over Boolean circuits, as well as a multi-round additive secret sharing based protocol over Arithmetic circuits. Furthermore, we introduce the con- cept of Oblivious RAM (ORAM) in Section 2.3, and show how ORAM can be combined with secure computation protocols to build more efficient ap- plications. Afterwards, in Section 2.4 we present a comparison of related compilers for MPC. Finally, we discuss multiple applications that are used for benchmarking purposes in Section 2.5. 2.1 digital logic and boolean circuits Boolean circuits are a common representation of functions in com- puter science and a mathematical model for digital logic circuits. A Boolean circuit is a Directed Acyclic Graph (DAG) with l inputs, m outputs, and s gates. Technically, the graph consists of |V | = l+m+ s nodes and |E| edges, where each node can either be of type input, out- put or gate. Due to the relation of digital circuit synthesis, the edges are called wires. We note that a Boolean circuit, as defined above, is also referred to as a combinatorial circuit. Combinatorial circuits are different to sequential circuits, which can be cyclic and carry a state. As all MPC protocols described in this thesis evaluate combinatorial circuits, their computational model is the combinatorial circuit model. boolean gates . Each gate in a Boolean circuit has one or mul- tiple input wires and one output wire, which can be input to many subsequent gates. Each gate in a circuit represents a Boolean function fg, which maps k input bits to one output bit, i.e., g(w1,w2, . . . ,wk) : {0, 1}k → {0, 1}. In this thesis, we will only consider gates with at most two input wires, namely unary and binary gates. The relevant unary gate is the NOT gate (¬), while typical binary gates are AND (∧), OR (∨), and XOR (⊕). Moreover, we distinguish between linear1 (e.g., XOR) and non-linear (e.g., AND, OR) gates, as they have different 1 The Boolean function represented by linear gates can be expressed by a linear com- bination of its inputs over Z2, e.g., XOR(X, Y) = X+ Y mod 2. 8 2.1 digital logic and boolean circuits 9 evaluation costs in secure computation [KS08]. In some works on MPC, non-linear gates are also referred to as non-XOR gates. notation. For a function f, we refer to its circuit representation as Cf. We use s to denote the total number of gates in a circuit, also referred to as size, i.e., s = size(Cf). Moreover, we use snX to count the number of non-linear (non-XOR) gates per circuit. We denote the depth of a circuit by d and use dnX to denote its non-linear depth, which is the maximum depth of all gates connected to an output node, defined recursively for a gate g as: dnX(g) =  0 if g is an input node, dnX(w1) if g is an unary gate with input w1, max(dnX(w1),dnX(w2)) if g with inputs w1,w2 is linear, max(dnX(w1),dnX(w2)) + 1 if g with inputs w1,w2 is non-linear. Furthermore, we denote bit strings in lower-case letters, e.g., x. We refer to individual bits, which can be part of a bit string, by capital letters X and denote their negation with X. We refer to a single bit at position i within a bit string as Xi. The Least-Significant Bit (LSB) is denoted with X0. When writing Boolean equations, we denote AND gates with ·, OR gates with + and XOR gates with ⊕. Moreover, when useful, we abbreviate the AND gate A ·B with AB. integer representation. Integers are represented in binary form, as typical in computer science and logic synthesis. Hence, the decimal value of a bit string x is ∑n−1 i=0 2 i · Xi. This common repre- sentation has the advantage that arithmetic operations for unsigned binary numbers such as addition, subtraction, and multiplication can be reused for signed numbers. In the two’s complement, negative numbers are represented by flipping all bits and adding one: −x = x+ 1. In the following chapters, we assume a two’s complement rep- resentation when referring to negative numbers. fixed and floating point representation. To represent real numbers, we use two representations. The fixed point represen- tation has a fixed position of the radix point (the decimal point). Hence, a binary bit string is split into an integer part and a frac- tional part. The decimal value of a bit string in fixed point representa- tion is computed as ∑n−1 i=0 2 i−r · Xi, where r is the position (counted from the LSB) of the radix point. The IEEE-754 floating point rep- resentation [Soc85] is the standard representation of floating point numbers. It divides a bit string in three components, namely sign s, 2.2 secure multi-party computation 10 significant m and exponent e. Its real value is determined by com- puting (−1)s ·m · 2e. In the IEEE-754 standard, the bit-width of each component, as well as their range is determined. Moreover, various error handling behavior is specified, e.g., overflow or division by zero [Gol91]. 2.2 secure multi-party computation Secure Multi-Party Computation (MPC), also referred to as secure computation, has been proposed in the 1980s as a more theoretical construct [Yao82]; it only became practical in the last fifteen years. MPC protocols are cryptographic protocols run between two or more parties that allow to perform a joint computation of a functionality f(x1, x2, . . . ) over the private inputs x1, x2, . . . of the participating parties P1,P2, . . . with guaranteed correctness and privacy. Privacy in MPC is understood as the guarantee that participating parties do not learn more about the other party’s inputs than they could already de- rive from the observed output of the joint computation. Thus, infor- mally speaking, MPC allows collaboration among mutually distrust- ing parties by realizing a virtual trusted party that receives the input of all parties, computes the functionality, and returns the output to the parties. In MPC different adverserial models are distinguished. Most rel- evant is the semi-honest (passive) model, where the adversary tries to learn as much from the protocol execution as possible, yet does not deviate from the protocol itself. This is opposed to the malicious model, where the adversary is allowed to actively violate the protocol. Examples of further adversarial models are the covert model by Au- mann and Lindell [AL07] or the dual execution approach that leaks at most one bit by Huang et al. [HKE12]. Examples for further security properties of MPC protocols that are currently mostly of theoretic in- terest are adaptive security (i.e., the adversary is allowed to choose its input x while running the protocol), fairness (i.e., either all parties receive the output of the computation or no party learns a single bit), guaranteed output delivery, and robustness [CDN15]. As we focus on compilation in this thesis, we only give a high-level description of three MPC protocols, where each is a representative for a different class (in their application description and cost model) of MPC protocols. For simplicity and explanatory reasons, we focus on two-party protocols secure in the semi-honest adversary model, as these are already sufficient to follow the ideas presented in this the- sis. Typically, MPC protocols in the semi-honest model are building blocks for protocols secure against malicious adversaries and are also used for many privacy-preserving applications on their own. We de- scribe the two most prominent MPC protocols over Boolean circuits, namely Yao’s Garbled Circuits and the GMW protocol. Moreover, in 2.2 secure multi-party computation 11 Chapter 7, we study the compilation of hybrid protocols that consist of multiple protocols, which involve Yao’s protocol, GMW, as well as an additive secret sharing based protocol. Therefore, we also give an introduction in MPC based on additive secret sharing. We begin with an introduction into Oblivious Transfer, which is a building block used for many MPC protocols and also used by the protocols described in the remainder of this section. 2.2.1 Oblivious Transfer Oblivious Transfer (OT) [Rab81; Kil88] is one of the most important building blocks for MPC protocols. An Oblivious Transfer protocol is a protocol in which a sender transfers one of multiple messages to a receiver, but it remains oblivious to the sender which message has been selected and retrieved by the receiver. At the same time, the receiver is only able to learn the content of the selected message, yet not anything about the other messages offered by the sender. In this thesis, we mostly use 1-out-of-2 OTs, where the sender in- puts two l-bit strings m0,m1 and the receiver inputs a bit C ∈ {0, 1}. At the end of the protocol, the receiver obliviously receives mC such that neither the sender learns the choice C, nor the receiver learns anything about the other message m1−C. In a classic result, Impagliazzo and Rudich [IR89] showed that OT cannot be constructed from one-way functions in a black-box man- ner. Moreover, as non-black-box constructions have also not been pre- sented, it was assumed that OT can only be constructed from asym- metric cryptography and thus was assumed to be very costly. How- ever, in 2003 Ishai et al. [Ish+03] presented the idea of OT Extension, which significantly reduces the computational costs of OTs for most interesting applications of MPC. In OT Extension, a constant number κ, i.e., the security parameter (e.g., κ = 128 bit), of OTs are realized using a traditional OT protocol based asymmetric encryption and re- ferred to as Base OTs. These Base OTs establish a symmetric key of length κ that can subsequently be used to execute a larger number of OTs with comparably cheap symmetric cryptography. Since Base OTs only need to be established once between two parties, similar to a key exchange protocol, their costs becomes negligible through amor- tization in many applications. Finally, we remark that variants of OT have been proposed that are beneficial for MPC, such as correlated or random OT [Ash+13]. As a consequence, recent implementations of OT Extension are capable of performing multiple millions of OTs per second on a single core [Ash+13; KOS15]. 2.2 secure multi-party computation 12 2.2.2 Yao’s Garbled Circuits Protocol Yao’s garbled circuits protocol, proposed in the 1980s [Yao86], is the most studied secure two-party computation protocol, secure in the semi-honest model. The protocol is run between two parties P0, P1 and operates on functionality descriptions in form of Boolean cir- cuits over binary Boolean gates. To securely evaluate a functionality f(x,y) over their private inputs x and y, both parties agree on a circuit Cf(x,y), which can be seen as the machine code for the protocol. During protocol execution, one party becomes the circuit genera- tor (the garbling party), the other the circuit evaluator. The generator initializes the protocol by assigning each wire wi in the circuit two random labels w0i and w1i of length κ (the security parameter), rep- resenting the respective Boolean values 0 and 1. For each gate the generator computes a garbled truth table. Each table consists of four encrypted entries of the output wire labels wγo . These are encrypted according to the gate’s Boolean functionality γ = g(α,β) using the input wire labels wαl and w β r as keys. Thus, an entry gttαβo in the table is encrypted as gttαβo = Ewαl (Ewβr (w g(α,β) o )). After their creation, the garbled tables are randomly permuted and sent to the evaluator, who, so far, is unable to decrypt a single row of any garbled table due to the random choice of wire labels. To initiate the circuit evaluation, the generator sends her input bits x in form of her input wire labels to the evaluator. Moreover, the wire labels corresponding to the evaluator’s input y are transferred via an OT protocol, with the generator being the OT sender, who inputs the two wire labels, and evaluator being the OT receiver, who inputs its input bits of the computation. After the OT step, the evaluator is in possession of the garbled circuit and one input label per input wire. With this information the evaluator is able to iteratively evaluate the circuit by decrypting a single entry of each garbled table, starting from input wires and ending at the output wires. Due to the use of a double-encryption scheme, the evaluator is unable to decrypt more than one entry in each garbled table. Once all gates are evaluated, all output wire labels are known to the evaluator. In the last step of the protocol, the generator sends an output description table to the evaluator, containing a mapping between output label and actual bit value. The decrypted output is then shared with the generator. Note that this procedure can be adapted to provide different outputs to the two parties. Alternatively, the generator can encrypt the cleartext output bits in the garbled tables of all gates connected to output wires, which allows to reveal the output of the computation to the evaluator without further communication. Selective security of Yao’s Garbled Circuits in the semi-honest model has been proven by Lindell and Pinkas [LP09] and recently also adap- 2.2 secure multi-party computation 13 tive security has been proven by Jafargholi and Wichs [JW16]. Yao’s protocol has also been extended to be secure in the malicious model in multiple works, e.g., [Hua+14; HKE13; LP15; Lin16]. implementations and optimizations . Yao’s original proto- col has seen multiple optimizations in the recent past. Most important are point-and-permute [BMR90; Yao86], which allows an efficient per- mutation of the garbled table such that only one entry in the table needs to be decrypted by the evaluator, garbled-row-reduction [NPS99], which reduces the number of ciphertexts that are needed to be trans- ferred per gate, free-XOR [KS08], which allows to evaluate linear gates (XOR/XNOR) essentially for ‘free’ without any encryption or com- munication costs, and finally the communication optimal half-gate scheme [ZRE15], which requires only two ciphertexts per non-linear gate while being compatible with free-XOR. Important for practical implementations is the idea of pipelining [Hua+11b], which allows a parallel circuit generation and evaluation and which is necessary for the evaluation of larger circuits and a faster online execution of Yao’s protocol, as well as the idea of fixed-key garbling [Bel+13], which al- lows to achieve garbling speeds of more than 10 million gates per second on a single CPU core using the AES instructions of modern CPUs. cost model . Considering all optimizations mentioned above, then for a given Boolean function f(x,y) and its circuit representation Cf(x,y), the evaluation costs of a circuit in Yao’s protocol are domi- nated by the number of the input bits, and by the number of non-linear gates in the circuit. In an amortized setting, the input gates are trans- ferred via OT Extension. Current implementations achieve a speed of more than 10 million OTs per second and require a bandwidth of two ciphertexts per OT. The garbling and evaluation costs of lin- ear gates are negligible for most applications, as communication is the current practical bottleneck of Yao’s protocol. To garble a non- linear gate, two entries from the garbled table of length κ each have to be transferred. Assuming a standard key length of κ = 128 bit, then a garbling throughput of 10 million non-linear gates per second produces ≈ 2.8Gbit of data per second. Considering that the com- munication requirement of two ciphertexts per gate is a proven lower bound for known garbling schemes [ZRE15], minimizing the num- ber of non-linear gates in a circuit is an important optimization goal when compiling applications into circuits for Yao’s protocol. We re- mark that this optimization goal holds for almost all MPC protocols over Boolean circuits, as these have a mechanism similar to free-XOR, which allow the evaluation of linear gates without communication. 2.2 secure multi-party computation 14 2.2.3 Goldreich-Micali-Wigderson (GMW) Protocol The Goldreich-Micali-Wigderson (GMW) protocol [GMW87] also orig- inates from the 1980s and allows two parties to securely compute a functionality described in form of a Boolean circuit. In contrast to Yao’s protocol, which uses the idea of garbled tables, GMW is built on top of Boolean secret sharing. Each input and intermediate wire value is shared among the two parties using an XOR based sharing scheme over single bits, i.e., for every value V ∈ {0, 1} each party P ∈ {0, 1} holds a share VP, indistinguishable from random, with V = V0 ⊕ V1. Values can be revealed at the end of the computation by exchanging and XORing the shares. Given shared values, XOR gates can be evaluated locally without any communication between the parties by XORing the respective wire shares, e.g., both parties compute WP = UP ⊕ VP over their shares of U and V to compute shares of W = U⊕ V . An AND gate Z = X · Y requires the parties to run an interactive protocol. One approach is to perform an 1-out-of-4 OT protocol [SZ13], where the chooser inputs its share X0 and Y0, and the sender prepares the out- put accordingly. Thus, the sender samples a random Z1 and computes its input, i.e., different Z0’s, into the OT protocol according to Z0 = Z1 ⊕ ((X0 ⊕X1) · (Y0 ⊕ Y1)), such that Z0 ⊕Z1 = 1 iff X0 ⊕X1 = 1 and Y0 ⊕ Y1 = 1. The same functionality can be realized in a pre-processing phase, which is independent of the functionality to be computed and which enables a very fast online phase that only requires the computation of a few bit-wise operations. For this purpose Boolean multiplication triples [Bea92] are used. A multiplication triple consists of three bits A,B, and C, where C = A · B holds, that are shared between the parties. Given such a triple, the two parties compute an AND gate Z = X · Y, by opening (reconstructing) two bits E = A⊕ X and F = B⊕ Y and computing their shares of Z as ZP = P · E · F⊕ F ·AP ⊕ E ·BP ⊕CP. The multiplication triples can be generated in a preprocessing phase using the 1-out-of-4 OT protocol described above. implementations and optimizations . A first implementa- tion was given by Choi et al. [Cho+12] that was subsequently im- proved by Schneider and Zohner [SZ13]. Furthermore, Asharov et al. [Ash+13] showed that the triples can be precomputed using only two random OTs with a total communication of 2κ bits, where κ de- notes the key length in bits. A protocol secure against malicious ad- versaries, named TinyOT, was proposed by Nielsen et al. [Nie+12]. 2.2 secure multi-party computation 15 cost model . Since the multiplication triples, which are the most expensive part of the computation, can be precomputed, only four bits (two per party) for every AND gate need to be communicated in the online phase. Moreover, we observe that the computation ef- fort of AND and XOR gates is negligible in practice when compared to the communication costs. Thus, when considering a typical band- width (6 1Gbit) constrained deployment scenario, GMW outper- forms Yao’s Garbled Circuits in gate throughput significantly, as only a fraction bits per gate have to be transmitted. GMW is a multi-round protocol, where the number of rounds is depending on the non-linear circuit depth. All AND gates on the same layer in the circuit can be computed in parallel and in the same communication round. Input sharing is significantly cheaper than in Yao’s Garbled Circuits, as only one bit per input wire has to be transmitted. We note that the compu- tation and communication efforts in the preprocessing phase are com- parable to the garbling cost of a gate in Yao’s Garbled Circuits [DSZ15; Des+17]. In summary, the performance of the GMW protocol for a given circuit Cf is dominated by the total number of non-linear gates snX as well as the non-linear depth dnX of the circuit (number of layers of AND gates), whereas the number of inputs and outputs marginally influences the performance of the GMW protocol. This cost model is prototypic for multi-round MPC protocols over Boolean circuits. 2.2.4 MPC Based on Additive Secret Sharing In the later part in this thesis we make use of an MPC protocol based on additive linear secret sharing that evaluates Arithmetic cir- cuits consisting of only addition and multiplication gates. Many other MPC protocols with different secret sharing schemes have been pro- posed, and we refer the reader to the book of Cramer et al. [CDN15] for a detailed introduction into the topic of MPC and secret sharing. In this thesis, we make use of the two-party MPC protocol de- scribed in [Ata+04; DSZ15], where an l-bit value is shared additively in the ring Z2l . Thus, for every value v ∈ Z2l in the protocol, each party P ∈ {0, 1} holds a share vP with v = v0 + v1 (this and all follow- ing computations are performed modulo 2l). To enter a new value x into the protocol, i.e., to share a value, party P samples a random xP ∈ Z2l , computes x1−P = x − xP, and sends x1−P to the other party. To output a shared value x, the parties send their own share xP to the other party and locally compute x = x0 + x1. Additions can be performed locally by adding the respective shares. Thus, to compute z = x+ y over the shares of x and y, both parties compute zP = xP + yP. Multiplications are performed with an interactive pro- tocol, which can be precomputed using multiplication triples [Bea92; Gil99]. To compute z = x · y over shared values x and y, a multiplica- 2.2 secure multi-party computation 16 tion triple c = a · b is used, such that both parties compute and then reveal eP = xP − aP and fP = yP − cP. Given eP and fP, both parties compute their share zP of z as follows: zP = P · e · f+ f · aP + e · bP + cP. Multiplication triples can be generated in the preprocessing phase, independently of the computed functionality with the help of ho- momorphic encryption [Ata+04] and additive blinding. Alternatively, multiplication triples can be generated by performing a bit-wise OT based protocol [Gil99]. implementations . An implementation of the protocol described above is given by Demmler et al. [DSZ15]. Alternative implementa- tions of MPC protocols based on additive linear secret sharing are Sharemind [BLW08], which provides the richest set of functionalities, VIFF [Dam+09], and the SPDZ [Dam+12] implementation [KSS13] as well as FRESCO [Dam+16], which both provide security against ma- licious adversaries. cost model . The communication and computation costs in addi- tive secret sharing based MPC is dominated by the number of multi- plication gates and the multiplicative depth of the circuit. Similar to the Boolean circuit based protocols, linear operations, i.e., additions, are for free in the number of cryptographic operations and in communi- cation. Multiplications have a small online cost, as only the respective shares have to be transmitted and only cheap local operations are performed. Moreover, similar to GMW, every multiplicative layer in the circuit increases the number of communication rounds. The pre- computation phase is the most expensive part of the protocol execu- tion. For the generation of the multiplication triples, either multiple (packed) encryptions in a homomorphic encryption systems or O(l) OTs of length l have to be performed. 2.2.5 Hybrid Protocols The efficiency of applications generated by compilers for all afore- mentioned protocols is bound by the efficiency to represent elemen- tary operations, also referred to as building blocks, in their respec- tive cost model. Because of this, the protocols can have substantial difference in their runtime for the same application. For example, ex- pressing multiplications over nbit integers in Yao’s protocol or GMW requires Boolean circuits of size O(n2). Due to this reason, multi- ple applications based on hybrid protocols, i.e., a mix of multiple standalone MPC protocols, have been proposed to achieve better ef- ficiency by exploiting multiple function representations, e.g., [BG11; Hua+11a; Nik+13a; SK11]. An application independent combination 2.3 oblivious ram and ram-sc 17 of protocols has been studied in [BLR13; Hen+10; SKM11; KSS14]. Most of these works combine Yao’s Garbled Circuits with homomor- phic encryption, i.e., an asymmetric encryption scheme, which allows to perform operations on the ciphertexts that resemble arithmetic op- erations on the respective cleartexts, e.g., addition. In this thesis, we evaluate hybrid protocols consisting of all three aforementioned protocols, which has been described by Demmler et al. [DSZ15]. To combine multiple sequentially aligned MPC proto- cols into a hybrid protocol, an intermediate state has to be converted securely between the different protocols. Such an intermediate state consists of multiple values that are secretly shared between between the computing parties. In Yao’s protocol, the wire label computed by the evaluator and the mapping of wire label to cleartext value form a sharing of one bit. In the two other protocols, the sharing is more straight-forward, namely either an XOR (Boolean) sharing is used or an additive (Arithmetic) sharing. The authors proposed conversion protocols that securely transform one sharing into the other. For ex- ample, a Yao sharing can be converted into a Boolean sharing at prac- tically no cost by using the point-and-permute bits (cf. Section 2.2.2). Unfortunately, other conversions require communication and com- putation. For example, the conversion from an nbit Arithmetic shar- ing into n Yao sharings or n Boolean sharings can be realized by eval- uating a Boolean addition circuit. Thus, both parties have to enter the bit decomposition of their arithmetic shares as input into either GMW or Yao’s protocol and then evaluate an addition circuit. A more detailed explanation of these conversion protocols and their costs is given in [DSZ15]. 2.3 oblivious ram and ram-sc In Chapter 8, we will present a compilation approach for RAM based MPC, which is often referred to as RAM-SC. In this section we give an overview of relevant ORAMs used in RAM-SC, before describing RAM-SC itself. 2.3.1 Oblivious RAM (ORAM) Oblivious RAM (ORAM), first introduced by Goldreich and Ostro- vsky in the context of software protection and simulation of RAM programs [GO96], is a cryptographic primitive that allows to obfus- cate the access pattern to an outsourced storage to achieve memory trace obliviousness. Therefore, each logical access on some virtual ad- dress space is translated into a sequence of physical accesses on the memory, which appears to be random to observers, resulting in the security guarantee that two sequences of virtual accesses of the same length produce indistinguishable physical access patterns. 2.3 oblivious ram and ram-sc 18 ORAMs are commonly modeled as protocols between an ORAM client, who is the data owner, and an untrusted ORAM server, who provides the physical storage. Typically, an ORAM construction is comprised of two distinct algorithms, the initialization and the ac- cess algorithm. Thereby, the initialization algorithm introduces a new oblivious structure for a given number of elements, while the access algorithm performs an access to one of the elements in the virtual memory space using a sequence of accesses on the physical memory, hiding the accessed virtual index and also the type of operation, i.e, read or write. An ORAM has a capacity m, which describes the num- ber of data elements it can store. Moreover, most ORAMs require to store metadata for each data element, which in combination with the element itself is referred to as block. The design goals of standalone ORAM constructions are manifold, e.g., minimizing client side storage, communication or computation costs. Therefore, many optimized ORAMs have been proposed, e.g., [GO96; KLO12; LO13; Ste+13]. For their combination with MPC (de- scribed next), a different cost model applies, because the ORAM client has to be evaluated as a circuit. In this thesis, we study the most ef- ficient known ORAMs for MPC, namely Circuit-ORAM (C-ORAM) [WCS15], optimized Square-Root ORAM (SQ-ORAM) [Zah+16], FLO- RAM [Ds17a], and the FLORAM variant CPRG (FCPRG) [Ds17a]. circuit oram (c-oram). C-ORAM by Wang et al. [WCS15] is an MPC-optimized derivative of Path ORAM [Ste+13], which is the most practical known standalone ORAM to date. C-ORAM is a tree-based ORAM that stores m data elements in a binary tree structure of at most log2 (m) stages and an additional root level called stash, which can also be imagined as a cache. Each node in the tree is a smaller ORAM itself, called bucket ORAM, that stores multiple blocks. Each data element is randomly mapped to one of the leaf nodes, maintain- ing the invariant that an element with leaf identifier l is contained in one of the buckets on the path from the root node to leaf l or in the stash. To read or write an element in tree-based ORAMs, the ac- cording path from root to the leaf is read, a new leaf identifier for the read element is chosen, and the accessed path is moved to the stash. Moreover, after each access an eviction procedure is initiated, which writes blocks from stash into the tree while moving blocks as close as possible to their designated leaves. C-ORAM requires a recursive position map in RAM-SC, which as- sociates the virtual index of each element with the position in the tree. Henceforth, a tree-based ORAM has several construction parameters, including the size of the bucket ORAMs B, the number of recursive steps r, and a packing factor c describing the number of mappings contained in one block of the recursive ORAMs. 2.3 oblivious ram and ram-sc 19 square-root oram (sq-oram). SQ-ORAM was one of the two ORAMs introduced in the seminal paper by Goldreich and Ostro- vsky [GO96] and later optimized for MPC by Zahur et al. [Zah+16]. SQ-ORAM uses a fundamentally different strategy than Path ORAM. Its core idea is to randomly permute the memory and to periodically refresh this permutation. For m elements, the so-called permuted mem- ory has size m+ √ m, Furthermore, a shelter/stash for √ m elements is used. The simulation of a RAM program takes place in so called epochs of √ m steps, consisting of three phases: In a first step the memory is obliviously permuted using a permutation π that assigns each element a position in the permuted memory by using random tags assigned to each element. Afterwards, √ m virtual accesses can take place, during which the updated values are written to the shel- ter. To access an element at index v, first the entire shelter is scanned. If the element cannot be found, the permuted memory is accessed to retrieve the element at position π(v), otherwise, the element has previously been visited and can thus be found in the shelter, and a dummy access to the permuted memory is performed. As last step of an epoch, the permuted memory is updated according the shelter. Zahur et al. [Zah+16], removed the use of Pseudo Random Func- tions (PRFs) (needed for the permutation), dummy elements, expen- sive oblivious sorting algorithms, and identified public metadata. Fur- thermore, they introduced the usage of recursive maps to compute the mapping between virtual and physical addresses. floram . FLORAM, recently introduced by Doerner and she- lat [Ds17a], differs from the other ORAM constructions as it is built from Function Secret Sharing (FSS), introduced by Boyle et al. [BGI15]. FLORAM is a distributed ORAM [Ds17b; LO13], where the data is stored in a secret shared manner (XOR) between two servers. Ele- ments are accessed using Private Information Retrieval (PIR) tech- niques, i.e., a short query is evaluated on all elements on the server to extracted the desired element. A point function fα,β(x) evaluates to β if x = α and 0 otherwise. Informally, FSS allows to share a Distributed Point Function (DPF) in such a way, that the parties can evaluate the point function on arbitrary input, yet neither learn α nor β. This feature allows the client to send the servers specially crafted queries q0(i) or q1(i) using FSS to retrieve or to write an element at index i. FLORAM distinguishes a read-only memory (OROM) and write- only memory (OWOM). Data in both is stored using an XOR sharing, yet in OROM, each share is additionally masked using a PRF with a key known only to the storing party. After a number of write accesses, the OWOM memory is converted into the OROM. This process is re- ferred to as refresh. To read an index i, the client shares a DPF that evaluates to 1 on input i and to 0 otherwise. The servers evaluate 2.3 oblivious ram and ram-sc 20 the DPF on all indices, multiply the result with each associated ele- ment share and return their aggregated result to the client. Writing is performed using a similar approach. A conversion between OWOM and OROM is too expensive to be performed after every write access, therefore a stash is used that functions as a cache between refreshes. The stash is scanned when performing read accesses to identify up- dated elements. floram cprg (fcprg). FCPRG [Ds17a] is an extension to FLO- RAM. The most expensive computation on the client side of FLO- RAM is the creation of the FSS scheme, which requires to compute 2 · log2(m) PRFs for every access. In MPC the evaluation of the PRFs can render the scheme expensive and therefore, with FCPRG the au- thors propose to compute the PRFs locally, i.e., outside of MPC, with the trade-off thatO(log2(m)) interactions between the computing par- ties are required. 2.3.2 RAM based MPC (RAM-SC) MPC protocols evaluate functionalities represented as circuits. Cir- cuits allow to express arbitrary computations, yet random memory accesses have to be expressed as a chain of multiplexers of the com- plete memory, referred to as linear scan (LS). This limits MPC for applications that rely on dynamic memory accesses. Therefore, Gor- don et al. [Gor+12] proposed to combine MPC protocols with ORAM to enable dynamic memory accesses with sublinear overhead. The authors describe a RAM machine, where the circuit computes an oblivious machine that evaluates instructions and memory accesses. A complete RAM machine is often not necessary, and thus the so- called RAM-SC model was later refined by Liu et al. [Liu+14] for practical efficiency. Its major concepts are described in the following paragraphs. First, the parties performing the MPC protocol also act as distributed ORAM server, and the ORAM client is implemented as circuit evalu- ated by the MPC protocol itself. Thus, both roles are shared between the computing parties. Intuitively, privacy is preserved because the MPC protocol acts as a virtual third party emulating the ORAM client. Second, a program is evaluated by interweaving the MPC protocol with ORAM accesses. Consequently, a RAM-SC program consists of many small protocols that either perform a computation or an ORAM access. This behavior is exemplary illustrated in Figure 1. The construction of RAM-SC, as described, is very generic because it allows to combine different MPC protocols and ORAMs. When as- suming composable ORAMs, we observe that in one RAM-SC pro- gram multiple ORAMs of possibly different type can be used, e.g., one ORAM for each array in the input program. Moreover, as in stan- 2.3 oblivious ram and ram-sc 21 P1 P2 computa�ons x = 1023 oram_read(5) decrypt(b1, b6, b7) computa�ons x = array[5] MPC protocol (ORAM client) ORAM1 (server) ORAM2 (server) Figure 1: Exemplary and simplified illustration of RAM-SC. A program flow is illustrated that is computed within an MPC protocol, run be- tween two parties P1 and P2. At some point, a value is read from an array with virtual index 5. Therefore, a circuit representing the ORAM client functionality is executed that translates the virtual in- dex into multiple physical addresses. These addresses are revealed to both parties, who enter the blocks as input to the MPC protocol. Not shown here is, that often also an intermediate state has to be transferred between two MPC protocol. dalone ORAMs, the blocks stored on the ORAM server have to be encrypted. This can be realized by performing an encryption and de- cryption within a circuit, which requires (even highly optimized) a substantial amount of gates, e.g., 5000 non-linear gates to encrypt a single block of 128 bits AES [BP12], using a secret sharing scheme, e.g., XOR sharing [Ds17a], or by soldering the existing garbled labels based on the publicly revealed index [Zah+16]. In the XOR sharing approach a physical block is read by entering the shares as input to the MPC protocol, which are then recombined within the protocol. Similarly, to write to one or multiple blocks, the MPC protocol out- puts one share for each block to every party. When using the solder- ing approach, the circuit garbler re-uses the existing wire labels but remaps them according the accessed indices reveled to both parties, similar to a multiplexer (array) access with public index. We also re- mark that in RAM-SC, the ORAM access type, i.e., read or write, can be revealed to both parties, as the algorithm description is seen as public knowledge. This access type is also referred to as semi-private access [Ds17a]. security. RAM-SC provides the same privacy and correctness properties against semi-honest adversaries as traditional MPC pro- tocols [Gor+12]. RAM-SC with security against malicious adversaries has been studied in [Afs+15]. complexity. The computation and communication complexity of a RAM-SC protocol depends on the circuit complexity of the com- putation, the circuit complexity of the ORAM client, the number of protocol rounds, as well as additional ORAM protocol costs that are 2.4 compilers for mpc 22 performed outside of the MPC protocol. For ORAMs with less than O(m) computations or less thanO(m) bandwidth RAM-SC is (asymp- totically) more efficient than any circuit based MPC protocol. 2.4 compilers for mpc Due to the complexity of circuit design, multiple compilers for MPC have been proposed. In this section we give an overview and classifi- cation of these, which partially form the related work of this thesis. compiler classification. In general, compilers for MPC share many similarities with high-level synthesis tools [Gaj+12], known from digital circuit design. Nevertheless there are substantial differ- ences between these tools and compilers for MPC, which will be studied in detail in the following chapters. Therefore, we restrict our comparison and classification to compilers that have explicitly been proposed for their use in MPC. Compilers for MPC can be categorized based on whether they com- pile a (minimalistic) Domain Specific Language (DSL) or a widely used common programming language. Moreover, compilers can be inde- pendent or integrated into an MPC framework. Integrated compilers produce an intermediate representation, which is interpreted (instan- tiated by a circuit) only during the execution of an MPC protocol. These interpreted circuit descriptions commonly allow a more com- pact circuit representation. Independent compilers create circuits in- dependent from the executing framework, and thus have the advan- tage that produced circuits can be optimized to the full extent during compile time and are more versatile in their use in MPC frameworks. Especially for Arithmetic circuits it is often hard to make a strict separation. When compiling these circuits, compilers commonly tar- get the so called Arithmetic Black Box (ABB) model [DN03], which is an abstraction between compiler and protocol implementation. The ABB model provides representations for values and operations on these, e.g., addition and multiplication, without providing details about their protocol implementation to the compiler. In principle, an ABB is Turing complete with only addition and multiplication. Yet, practical efficiency is only achieved if further functionalities, such as comparisons, are also provided. We consider an Arithmetic circuit compiler to be independent if it compiles towards an ABB model only consisting of elementary functions, involving comparisons and possibly floating point operations, yet not high-level functionalities such as sorting or statistical tests, as for example provided by Share- mind [BLW08]. Some integrated compilers support the compilation of mixed-mode secure computation. Mixed-mode computation allows to write code that distinguishes between oblivious (private) and public computa- 2.4 compilers for mpc 23 tion. Thus, the public computation is performed locally in the clear, interweaved by private computations insides an MPC protocol. This leads to an even tighter coupling between compiler and execution framework, but allows to express a mixed-mode program in a single language. Moreover, some integrated compilers support the compila- tion for hybrid secure computation protocols or RAM-SC. Next, we give an overview of Boolean circuit and Arithmetic circuit compilers. boolean circuit compilers . We begin by studying compilers targeting Boolean circuit based MPC protocols that use a DSL as input language. The Fairplay framework by Malkhi et al. [Mal+04] started research on practical MPC. Fairplay compiles a domain spe- cific hardware description language called SFDL into a gate list for use in Yao’s protocol. Following Fairplay, Henecka et el. [Hen+10] presented the TASTY compiler, which compiles its own DSL, called TASTYL, into an interpreted hybrid protocol. The PAL compiler by Mood et el. [MLB12] aims at low-memory devices as the compilation target. PAL also compiles Fairplay’s hardware description language. The KSS compiler by Kreuter et al. [KSS12] is the first compiler that shows scalability up to a billion gates and uses gate level optimiza- tion methods, such as constant propagation and dead-gate elimina- tion. KSS compiles a DSL into a flat circuit format. TinyGarble by Songhori et al. [Son+15] uses (commercial) hardware synthesis tools to compile circuits from hardware description languages such as Ver- ilog or VHDL. On the one hand, this approach allows the use of a broad range of existing functionality in hardware synthesis, but also shows the least degree of abstraction by requiring the developer to be experienced in hardware design. We remark that high-level synthesis from C is possible, yet, as the authors note, this leads to significantly less efficient circuits. Recently, Mood et al. [Moo+16] presented the Frigate compiler, which aims at very scalable and extensively tested compilation of another DSL. Frigate and TinyGarble produce com- pact circuit descriptions that compress sequential circuit structures. Examples for mixed-mode and hybrid compilers involving Boolean circuits from a DSL are L1, Wysteria, ObliVM, and Obliv-C. The L1 compiler by Schröpfer et al. [SKM11] compiles a DSL into a hybrid protocol involving homomorphic encryption. Wysteria by Rastori et al. [RHH14] creates Boolean circuits applied to GMW from a mixed- mode DSL that is supported by a strong formal calculus. ObliVM by Liu et al. [Liu+15b] extends SCVM [Liu+14], which both compile a DSL that support the combination of oblivious data structures with MPC. This approach allows the efficient development of oblivious algorithms. Yet, both compilers provide only very limited gate and source code optimization methods. The Obliv-C compiler by Zahur and Evans [ZE15] also supports oblivious data structures, but fol- lows a different, yet elegant source-to-source translation approach by 2.4 compilers for mpc 24 C om pi le r La ng ua ge Ta rg et Pr ot oc ol In te rp re te d M ix ed -m od e M at ur ity [M oo +1 6] C od e Le ve lO pt im iz at io n G at e Le ve lO pt im iz at io n C BM C -G C ’1 2 A N SI C Ya o no no ye s gl ob al de ad ga te el im in at io n, co ns ta nt pr op ag at io n, th eo re m re w ri ti ng ,S A T sw ee pi ng K SS ’1 2 D SL Ya o no no no no co ns ta nt pr op ag at io n, de ad ga te el im in at io n PC F’ 1 3 A N SI C Ya o ye s ye s no no - W ys te ri a’ 1 4 D SL G M W ye s ye s n/ a lim it ed lo ca lc on st an t pr op ag at io n, de ad ga te el im in at io n O bl iv C ’1 5 D SL A N SI C Ya o ye s ye s no lo ca l lo ca la nd lim it ed [M oo +1 6 ] co ns ta nt pr op ag at io n, de ad ga te el im in at io n O bl iV M ’1 5 D SL R A M -S C ye s ye s no no - Ti ny G ar bl e’ 1 5 Ve ri lo g V H D L Ya o & G M W no no ye s n/ a de ad ga te el im in at io n, co ns ta nt pr op ag at io n, re w ri ti ng ,S A T sw ee pi ng Fr ig at e’ 1 6 D SL Ya o no no ye s lo ca l lo ca lc on st an t pr op ag at io n, de ad ga te el im in at io n C ir cG en ’1 7 A N SI C Ya o no no ve ri fie d gl ob al de ad ga te el im in at io n, co ns ta nt pr op ag at io n Ta bl e 1 :C om pa ri so n of re ce nt Bo ol ea n ci rc ui t co m pi le rs fo r M PC .C om pa re d is th e so ur ce la ng ua ge ,w he th er th ey co m pi le to co m pl et e ci rc ui t or an in te rm ed ia te re pr es en ta ti on th at is in te rp re te d du ri ng ru nt im e, w he th er th ey su pp or tm ix ed -m od e M PC ,t he ir m at ur ity (c om pi la ti on co rr ec tn es s) ba se d on th e st ud y by M oo d et al . [M oo +1 6 ], an d th e ap pl ie d op ti m iz at io n te ch ni qu es on th e so ur ce -c od e le ve l (c on st an t pr op ag at io n an d co ns ta nt fo ld in g) an d on th e ga te le ve l( lo gi c m in im iz at io n) . 2.5 benchmarking applications for mpc compilers 25 compiling a modified variant of C into efficient mixed-mode applica- tions. Very recently, EzPC [Cha+17] has been presented that compiles mixed-mode hybrid protocols involving Boolean and Arithmetic cir- cuits. The CBMC-GC compiler [Hol+12] is the first compiler that creates circuits for MPC from a common programming language (ANSI C). CBMC-GC follows the independent compilation approach and pro- duces a single circuit description. Moreover, CBMC-GC applies source code optimization and provides a powerful symbolic execution. This compiler is extended throughout this thesis. The PCF compiler by Kreuter et al. [Kre+13] also compiles C, using the portable LCC com- piler as a frontend. PCF compiles an intermediate bytecode repre- sentation given in LCC into a interpreted circuit format. PCF shows greater scalability than CBMC-GC, yet only supports comparably lim- ited optimization methods that are only applied locally for every func- tion. A (partially) formally verified compiler, named CircGen, has been proposed by Almeida et al. [Alm+17] that extends the Com- pcert compiler [Ler06] for ANSI C by a new backend that compiles Boolean circuits. A detailed overview of Boolean circuits compilers for MPC is given in Table 1. arithmetic circuit compilers . The Viff compiler [Dam+09] jointly with the proposition of a DSL for MPC [NS07] are two of the first works in the direction of compilation for Arithmetic circuit based MPC. The most prominent compiler is the Sharemind compiler by Bogdanov et al. [BLW08], which provides a production ready mixed- mode compiler suite (and protocol framework) for a DSL called Se- creC. The PICCO compiler by Zhang et al. [ZSB13] implements a mixed-mode programming environment for C that supports paral- lelization. The Armadillo compiler by Carpov et al. [CDS15] and the ALCHEMY compiler by Crockett et al. [CPS18] target the creation of Arithmetic circuits for homomorphic encryption. Recently, also an integrated compiler for SPDZ has been proposed [Spd]. 2.5 benchmarking applications for mpc compilers MPC has many applications, e.g., privacy-preserving biometric au- thentication [Erk+09], private set intersection [FNP04], or secure auc- tions [Bog+09]. For research purposes, some (parts of) these appli- cations have become popular for benchmarking purposes. The most prominent example is the AES block cipher, which is the de facto stan- dard benchmark for MPC protocols. In compiler research on MPC, multiple different functionalities are commonly compiled into cir- cuits, which are then compared regarding their properties, i.e., size, depth and fraction of non-linear gates. 2.5 benchmarking applications for mpc compilers 26 In this section, we give an overview of these commonly bench- marked functionalities, ranging from very small snippets, e.g., inte- ger addition, to larger functionalities, e.g., IEEE-754 compliant float- ing point operations or biometric authentication. In later chapters, we use these functionalities to evaluate the techniques presented in this book. Here, we discuss the functionalities together with their com- plexity and motivate why these functionalities are relevant in the con- text of MPC and the development of compilers. Common benchmark functionalities are: • Integer arithmetic. Due to their (heavy) use in almost every com- putational problem, arithmetic building blocks are of high im- portance in high-level circuit synthesis and are often bench- marked individually. Most common building blocks are addi- tion, subtraction, multiplication, and division. When studying these building blocks, one should distinguish the results for dif- ferent input and output bit widths. Namely, arithmetic opera- tions can be differentiated in overflow-free operations, e.g., allo- cating 2n bits for the result of a n×n bit multiplication, and op- erations with overflow, e.g., when only allocating n bits for the same multiplication. In later chapters we explicitly state which input and output bit-widths are studied. To evaluate the scala- bility of compilers, multiple sequentially (in-)dependent arith- metic operations can be compiled as a single functionality. • Matrix multiplication. Algebraic operations, such as matrix or vector multiplications, are building blocks for many privacy- preserving applications, e.g., for feature extractors in biometric matching or the privacy friendly evaluation of neural networks, and have therefore repeatedly been used to benchmark compil- ers for MPC [Dem+15; Hol+12; Kre+13]. From an optimizing compilers perspective, matrix multiplication is a comparably simple task, as it only involves the instantiation of arithmetic building blocks. However, it is very well suited to show the scal- ability of compilers or the capability to fuse multiple arithmetic operations into single statements, as required for depth mini- mization (see Chapter 6). Matrix multiplication can be parametri- zed according the matrix dimensions, the used number repre- sentation (integer, fixed or floating point) and its bit-width. • Modular exponentiation. Modular exponentiation, i.e., xymodp, is relevant in secure computation, as it is a building block for various cryptographic schemes. For example, blind signatures can be realized with RSA using modular exponentiation. Blind signatures allow a signing process, where neither the message is revealed to the signing party, nor the signing key is revealed to the party holding the message. Therefore, modular exponen- 2.5 benchmarking applications for mpc compilers 27 tiation has been used multiple times to study the performance of MPC [Bel+13; DSZ15; KSS12]. • Distances. Various distances need to be computed in many pri- vacy preserving protocols and are therefore relevant for MPC, e.g., in biometric matching applications or location privacy ap- plications. The Hamming distance is a measure between two bit strings. Measured is the number of pairwise differences in every bit position. Due to its application in biometrics, the Hamming distance has often been used for benchmarking MPC compilers, e.g., [Hol+12; KSS12; Moo+16]. The Manhattan distance distMH = |x1 − x2|+ |y1 − y2| between two points a = (x1,y1) and b = (x2,y2) is the distance along a two dimensional space, i.e., along the Manhattan grid, when only allowing horizontal or vertical moves. The Euclidean distance is defined as distED = √ (x1 − x2)2 + (y1 − y2)2. Due to the complexity of the square root function, it is com- mon in MPC to benchmark the squared Euclidean distance sep- arately, as its computation is usually sufficient when comparing multiple distances. All distances can be parameterized accord- ing to the used input bit-widths. • Biometric matching. In biometric matching a party matches one biometric sample (a vector of features) against the other’s party database of biometric templates. Example scenarios are face- recognition or fingerprint-matching [Erk+09]. One of the main concepts is the computation of a distance (see above) between a sample and all database entries. Once all distances are com- puted, the minimal distance determines the best match. The biometric matching application is very interesting for bench- marking, as it involves many parallel arithmetic operations, as well as a large number of sequential comparisons. The biomet- ric matching application can be parameterized by the database size, the sample dimension (number of features per sample), the bit-width of each feature, and the used distance. • Database analytics. Performing data analytics on sensitive data has numerous applications and therefore many privacy-preserv- ing protocols have been studied, e.g., [Bog+15; DHC04]. Using generic MPC techniques is of special interest in analytics, as it allows to perform arbitrary computations, e.g., hypothesis test- ing, or allows to add data perturbation techniques, e.g., differen- tial privacy, before releasing the result with minimal effort. This task can be parametrized according the database size(s) and the applied analyses. 2.5 benchmarking applications for mpc compilers 28 • Fixed and floating point arithmetics. Fixed and floating point num- ber representations allow the computation on real numbers us- ing integer arithmetic. Therefore, these representations are nec- essary for all applications where numerical precision is required, e.g., in privacy preserving statistics. The floating point represen- tation is the more versatile representation, as it allows to repre- sent a larger range of values, whereas fixed point arithmetic can be realized with significant less costs and is therefore of interest in MPC. Floating point operations are also very suited to evalu- ate the gate level optimization methods of compilers since they require many bit operations. When implementing floating point arithmetic we follow the IEEE-754 standard. • Gaussian elimination. Solving linear equations is required in many applications with Gaussian elimination being the most studied solving algorithm. Due to its wide range of use, it is also of relevance in privacy-preserving applications. In this thesis we evaluate our compiler using a textbook Gauss solver with par- tial pivoting. The task can be parameterized by the number of equations. • Location-aware scheduling. Privacy-preserving availability schedul- ing is another application that can be realized with MPC [Bil+11]. The functionality matches the availability of two parties over a number of time slots, without revealing the individual schedule to the other party. Location-aware scheduling also considers the location and maximum travel distance of the two parties for a given time slot. Therefore, the functionality outputs a matching time slot where both parties are available and in close proximity to each other (if existent). The functionality can be parameter- ized by the number of time slots used per party, the representa- tion of locations, and the distance measure, e.g., Manhattan or Euclidean distance. • Machine learning. Machine learning (ML) with its distinction in super- and unsupervised learning techniques has many applica- tions and is a very active field of research. Therefore, protecting the privacy of training data or ML inputs is also an active re- search area. Supervised machine learning – Neural networks. One of the most powerful ML techniques are Convolutional Neural Networks (CNNs). Because of this, many dedicated protocols for private data classification using CNNs have been proposed, e.g., [Gil+16; Liu+17; Ria+18]. Computationally, CNNs consists of multiple matrix multiplications concatenated with convolutions and non- linear activation functions. 2.5 benchmarking applications for mpc compilers 29 Unsupervised machine learning – k-means. Clustering is another ML task, frequently used to identify centroids in unstructured data. One of the most well known clustering algorithms is k- means, and multiple works proposed dedicated privacy-preserv- ing k-means protocols, e.g., [JW05; VC03]. The k-means algo- rithm can be parametrized according to the number of data points, the number of clusters, and the number of iterations to perform the algorithm. • Median computation. The secure computation of the median is required in privacy preserving statistics. It is an interesting task for compilation, as it can be implemented using a sorting al- gorithm, because a sorted array allows a direct access to the median element. The compiled circuit depends on the used sort- ing algorithm. In this book, we evaluate Bubble and Merge sort. Moreover, the task can be parameterized according to the num- ber of elements and their bit-width. implementation differences . We remark that for a fair com- parison between multiple compilers that compile the same function- ality, it has to be made sure that the same algorithm as well as the same abstraction level of the source code is used. To illustrate this thought we give three example implementations for computing the Hamming distance, which produce circuits of largely varying sizes (cf., Section 4.5). The distance can be computed by XOR-ing the input bit strings and then counting the number of bits. An exemplary im- plementation is given in Listing 1 that computes the distance between two bit-strings of length 160 bit, which are split over five unsigned in- tegers. In Line 7 the number of ones in a string of 32 bit is computed. This task is also known as population count. In the following para- graphs we describe three different implementations. 1 #define N 5 2 void hamming () { 3 unsigned INPUT_A_x[N]; 4 unsigned INPUT_B_y[N]; 5 unsigned res = 0; 6 for(int i = 0; i < N; i++) { 7 res += count_naive32(INPUT_A_x[i]^ INPUT_B_y[i]); 8 } 9 unsigned OUTPUT_res = res; 10 } Listing 1: Hamming distance computation between two bit strings. The first implementation is given in Listing 2. In this naïve ap- proach, each bit is extracted using bit shifts and the logical AND 2.5 benchmarking applications for mpc compilers 30 operator before being aggregated. 1 unsigned char count_naive32(unsigned y) { 2 unsigned char m = 0; 3 4 for(unsigned i = 0; i < 32; i++) { 5 m += (y & (1 << i)) >> i; 6 } 7 8 return m; 9 } Listing 2: Counting bits using a naïve bit-by-bit comparison approach. A variant of this implementation is given in Listing 3. Here, the bit string of length 32 is first split into chunks of 8 bit (unsigned char). The ones set in each chunk are then counted as described above. 1 unsigned char count_naive8(unsigned char c) { 2 unsigned char m = 0; 3 for(int i = 0; i < 8; i++) { 4 m += (c & (1 << i)) >> i; 5 } 6 return m; 7 } 8 9 unsigned char count_tree32(unsigned y) { 10 unsigned char m0 = y & 0xFF; 11 unsigned char m1 = (y & 0xFF00) >> 8; 12 unsigned char m2 = (y & 0xFF0000) >> 16; 13 unsigned char m3 = (y & 0xFF000000) >> 24; 14 15 return count_naive8(m0) + count_naive8(m1) + \ 16 count_naive8(m2) + count_naive8(m3); 17 } Listing 3: Counting bits over unsigned chars in a tree based manner. Finally, in Listing 4 a variant optimized for a CPU with 32 bit regis- ters and slow multiplication is given. This implementation uses only 14 CPU instructions. 1 unsigned count_reg32(unsigned y) { 2 unsigned x = y - ((y >> 1) & 0x55555555); 3 x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 4 x = (x + (x >> 4)) & 0x0f0f0f0f; 5 x += x >> 8; 6 x += x >> 16; 7 return x; 8 } Listing 4: Counting bits, optimized for a 32 bit register machine. 2.5 benchmarking applications for mpc compilers 31 Fundamental implementation differences will inevitably lead to dif- ferent compilation results. Therefore, all benchmarked functionalities in this book are implemented using the same algorithms, data struc- tures and data types when using them for the comparison with other compilers even when using different input languages. 3 F R O M A N S I C C O D E T O B O O L E A N C I R C U I T S Summary: The practicality of Secure Multi-party Computation (MPC) is hindered by the difficulty to implement applications on top of the under- lying cryptographic protocols. This is because the manual construction of efficient applications, which need to be represented as Boolean or arithmetic circuits, is a complex, error-prone, and time-consuming task. For the prac- tical use of MPC, and thus the development of further privacy-enhancing technologies, compilers supporting common programming languages are desirable to provide developers an accessible interface to MPC. In this chapter, we describe the translation approach that is followed by the Boolean circuit compiler CBMC-GC [Hol+12], which is based on the model checker CBMC [CKL04], and which compiles ANSI C into circuits ready for their use in MPC protocols. Throughout this thesis, we will extend the here presented tool-chain to create optimized circuits for MPC. Remarks: This chapter is based in parts on Chapter 3 of our book – “Compi- lation for Secure Multi-Party Computation”, Niklas Büscher and Stefan Katzen- beisser, Springer Briefs in Computer Science 2017, ISBN 978-3-319-67521-3. 3.1 motivation and overview When visualizing an MPC protocol as a hardware architecture that can execute programs (functionalities), then the program descriptions are Boolean circuits (e.g., for Yao’s Garbled Circuits [Yao86]) or Arith- metic circuits (e.g., for the SPDZ protocol [Dam+12]). Consequently, a compiler for MPC protocols compiles a functionality written in a high-level language into an (optimized) circuit representation. The core focus of this thesis is the creation of optimized Boolean circuits that also allow developers without a professional background in com- puter security and hardware design to create efficient applications for MPC. All techniques presented in