Razavi, Kamran (2024)
Resource Efficient Inference Serving With SLO Guarantee.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00028615
Ph.D. Thesis, Primary publication, Publisher's Version
Text
KamranRazavi_Diss.pdf Copyright Information: In Copyright. Download (7MB) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Resource Efficient Inference Serving With SLO Guarantee | ||||
Language: | English | ||||
Referees: | Wang, Prof. Dr. Lin ; Mühlhäuser, Prof. Dr. Max ; Hollick, Prof. Dr. Matthias ; Thies, Prof. Dr. Justus ; Binnig, Prof. Dr. Carsten | ||||
Date: | 6 November 2024 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | xxii, 242 Seiten | ||||
Date of oral examination: | 23 September 2024 | ||||
DOI: | 10.26083/tuprints-00028615 | ||||
Abstract: | Deep Learning (DL) has gained popularity in various online applications, including intelligent personal assistants, image and speech recognition, and question interpretation. Serving applications composed of DL models resource efficiently is challenging because of the computational intensity of the DL models with multiple layers and a large number of parameters, stringent Service Level Objective (SLO) (e.g., end-to-end serving latency requirements), and data dependency between the DL models. Efficiently managing computing resources becomes even more challenging in the presence of dynamic workloads. The primary focus of this thesis is the following overarching question: How can we design, implement, and deploy resource-efficient DL-based inference serving systems while ensuring a guaranteed SLO under both predictable and unpredictable workloads? In response to this question, this thesis presents four contributions aimed at enhancing the resource efficiency of DL-based applications and enabling the serving of DL models on nontraditional computing resources, such as programmable switches. First, we improve the multi-model serving system efficiency by increasing the system utilization in a work named, FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees. FA2 introduces a graph-based model to capture the resource allocation and batch size joint configurations and data dependency in DL inference serving systems and presents a horizontal-scaling-based resource allocation algorithm leveraging graph transformation and dynamic programming. Second, to guarantee the SLO while accounting for the dynamic conditions on the communication network between the user device and the serving system, we use the responsiveness benefit of vertical scaling and propose a new work titled Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling. Sponge employs in-place vertical scaling mechanisms to instantaneously adjust computing resources based on the demand, strategically reorders incoming requests to prioritize those with the most constrained remaining latency budgets, and utilizes dynamic batching techniques to increase the utilization of the system. Furthermore, to enhance user satisfaction, we introduce model variants---different DL models with varying cost, accuracy, and latency properties for the same request---by using vertical scaling and changing the DL model and the computing resources in a work named IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency. IPA introduces a multi-model accuracy metric to calculate the end-to-end accuracy by defining a new accuracy metric and using multiplication on this metric to approximate the overall accuracy of the application. Third, we explore how to achieve both responsiveness and resource efficiency. When the workload becomes unpredictable, the SLO violation can increase in the DL inference serving system. In-place vertical scaling can respond quickly to dramatic workload changes, but it is not as resource-efficient as horizontal scaling. We explore the use of both horizontal and vertical scaling simultaneously in a paper called Biscale: Integrating Horizontal and Vertical Scaling for Inference Serving Systems. In Biscale, we analyze the effect of both scaling mechanisms in terms of serving speed up using Amdahl's law and address three key questions of why, how, and when to switch between horizontal and vertical scaling mechanisms to guarantee the SLO and optimize the computing resources. Finally, we investigate the feasibility of serving DL models in programmable network devices for network security tasks like intrusion detection. Since programmable network devices are responsible for moving the data (including inference requests), serving inference requests directly on the programmable network devices accelerates the computation and eliminates the need for external computing resources. To achieve this, we design a novel method to divide a DL model into multiple parts and assign each part, along with its specific computing requirements, to a set of programmable network devices. We further train an intrusion detection DL model and utilize this approach in a paper titled NetNN: Neural Intrusion Detection System in Programmable Networks. In conclusion, this thesis comprehensively addresses the challenges regarding the main research question, providing innovative solutions to enhance the resource efficiency of DL inference serving systems while ensuring stringent SLO guarantees under various workloads. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-286159 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Telecooperation | ||||
TU-Projects: | DFG|SFB1053|SFB1053 TPA01 Mühlhä DFG|SFB1053|SFB1053 TPB02 Mühlhä |
||||
Date Deposited: | 06 Nov 2024 14:00 | ||||
Last Modified: | 08 Nov 2024 07:29 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/28615 | ||||
PPN: | 523280300 | ||||
Export: |
View Item |