Weber, Nicolas (2017)
GPU Array Access Auto-Tuning.
Technische Universität Darmstadt
Ph.D. Thesis, Primary publication
|
Text
main.pdf - Accepted Version Copyright Information: CC BY-NC-ND 4.0 International - Creative Commons, Attribution NonCommercial, NoDerivs. Download (10MB) | Preview |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | GPU Array Access Auto-Tuning | ||||
Language: | English | ||||
Referees: | Goesele, Prof. Dr. Michael ; Gerndt, Prof. Dr. Michael | ||||
Date: | 2017 | ||||
Place of Publication: | Darmstadt | ||||
Date of oral examination: | 19 June 2017 | ||||
Abstract: | GPUs have been used for years in compute intensive applications. Their massive parallel processing capabilities can speedup calculations significantly. However, to leverage this speedup it is necessary to rethink and develop new algorithms that allow parallel processing. These algorithms are only one piece to achieve high performance. Nearly as important as suitable algorithms is the actual implementation and the usage of special hardware features such as intra-warp communication, shared memory, caches, and memory access patterns. Optimizing these factors is usually a time consuming task that requires deep understanding of the algorithms and the underlying hardware. Unlike CPUs, the internal structure of GPUs has changed significantly and will likely change even more over the years. Therefore it does not suffice to optimize the code once during the development, but it has to be optimized for each new GPU generation that is released. To efficiently (re-)optimize code towards the underlying hardware, auto-tuning tools have been developed that perform these optimizations automatically, taking this burden from the programmer. In particular, NVIDIA -- the leading manufacturer for GPUs today -- applied significant changes to the memory hierarchy over the last four hardware generations. This makes the memory hierarchy an attractive objective for an auto-tuner. In this thesis we introduce the MATOG auto-tuner that automatically optimizes array access for NVIDIA CUDA applications. In order to achieve these optimizations, MATOG has to analyze the application to determine optimal parameter values. The analysis relies on empirical profiling combined with a prediction method and a data post-processing step. This allows to find nearly optimal parameter values in a minimal amount of time. Further, MATOG is able to automatically detect varying application workloads and can apply different optimization parameter settings at runtime. To show MATOG's capabilities, we evaluated it on a variety of different applications, ranging from simple algorithms up to complex applications on the last four hardware generations, with a total of 14 GPUs. MATOG is able to achieve equal or even better performance than hand-optimized code. Further, it is able to provide performance portability across different GPU types (low-, mid-, high-end and HPC) and generations. In some cases it is able to exceed the performance of hand-crafted code that has been specifically optimized for the tested GPU by dynamically changing data layouts throughout the execution. |
||||
Alternative Abstract: |
|
||||
URN: | urn:nbn:de:tuda-tuprints-65072 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Graphics, Capture and Massively Parallel Computing Exzellenzinitiative > Graduate Schools > Graduate School of Computational Engineering (CE) |
||||
Date Deposited: | 01 Aug 2017 08:42 | ||||
Last Modified: | 09 Jul 2020 01:44 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/6507 | ||||
PPN: | 415280397 | ||||
Export: |
View Item |