Norouzi, Mohammad (2022)
Enhancing the Speed and Automation of Assisted Parallelization.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00022884
Ph.D. Thesis, Primary publication, Publisher's Version
Text
Mohammad_Norouzi_PhD_Dissertation.pdf Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike. Download (3MB) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Enhancing the Speed and Automation of Assisted Parallelization | ||||
Language: | English | ||||
Referees: | Wolf, Prof. Dr. Felix ; Haehnle, Prof. Dr. Reiner ; Jannesari, Prof. Dr. Ali | ||||
Date: | 2022 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | 145 Seiten in verschiedenen Zählungen | ||||
Date of oral examination: | 2 December 2021 | ||||
DOI: | 10.26083/tuprints-00022884 | ||||
Abstract: | Parallelization is a technique that boosts the performance of a program beyond optimizations of the sequential algorithm. Utilizing the technique requires deep program knowledge and is usually complex and time-consuming. Software tools have been proposed to discover parallelism opportunities. Tools relying on static analysis follow a conservative path and tend to miss many opportunities, whereas dynamic analysis suffers from a vast runtime overhead, often resulting in a slowdown of 100x. In this dissertation, we present two methods that help programmers parallelize programs. We abandon the idea of fully automated parallelization and instead pinpoint programmers to potential parallelism opportunities in the source code. Our first method detects data dependences using a hybrid approach, mitigating the limitations of both static and dynamic analyses. Our second method exploits the identified dependences to provide practical hints for parallelizing a sequential program. Data dependence analysis can be performed statically or dynamically. Static analysis is fast, but it overestimates the number of data dependences. Dynamic analysis records dependences that actually occur during program execution but suffers from high runtime overhead and is also input sensitive. We have proposed a hybrid approach that considerably reduces the overhead and does not overestimate data dependences. Our approach first detects memory-access instructions that create statically-identifiable data dependences. Then, it excludes these instructions from the instrumentation, avoiding their associated overhead at runtime. We have implemented our approach in DiscoPoP, a parallelism discovery tool, and evaluated it with 49 benchmarks from three benchmark suites (i.e., Polybench, NPB, and BOTS) and two simulation applications (namely, EOS-MBPT and LULESH). The median reduction of the profiling time across all programs was 76%. Additionally, we proposed a method that uses the identified dependences to make recommendations on how to parallelize a program with OpenMP. OpenMP allows programmers to annotate code sections in a program with parallelization constructs. Programming with OpenMP is not easy. Programmers need to determine which construct to insert where in the source code to maximize performance and preserve correctness. Another task is classifying variables inside the constructs according to their data-sharing semantics. Performing these tasks manually is complex and error-prone. We have proposed an approach that automates these tasks. Our approach receives as input parallel design patterns derived from the extracted data dependences and maps them to appropriate OpenMP constructs and source-code lines. Further, it classifies the variables within those constructs. After integrating our parallelization approach into DiscoPoP, we used it to parallelize our test programs. We compared their execution times with their sequential versions. We observed a speedup of up to 1.35x for EOS-MBPT and 8x for LULESH. For the benchmarks, we further compared our parallelizations with those generated by three state-of-the-art parallelization tools: We produced faster codes in most cases with an average speedup relative to any of the three ranging from 1.8 to 2.7. Also, we automatically reclassified variables of OpenMP programs parallelized manually or with the help of these tools, reducing their execution time by up to 29%. Moreover, we found that the inherent input sensitivity of the dynamic dependence analysis, if running the target program with a range of representative inputs, does not make the resulting parallel programs harder to validate than those parallelized manually. Finally, our approach has been extended to suggest offloading suitable kernels onto the GPU using OpenMP. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-228849 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Parallel Programming | ||||
Date Deposited: | 24 Nov 2022 13:10 | ||||
Last Modified: | 25 Nov 2022 07:11 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/22884 | ||||
PPN: | 502052678 | ||||
Export: |
View Item |