Publication

Extending, improving and optimizing Marrow

Bibliographic Details
Summary:	Most computers nowadays are heterogeneous, composed of a Central Processing Unit (CPU) and one or more Graphics Processing Units (GPUs). In order to harness the power of each of these devices, developers must have experience with low-level toolchains such as CUDA, and expert knowledge of the underlying architecture. However these low-level approaches add several layers of complexity to the task at hand. High-level programming models such as the Marrow framework are used to attenuate the arduous task that is offloading computation to accelerator devices. Usually, they do so by abstracting memory management and implicitly parallelizing workloads by exposing high-level constructs to the programmer. However, these frameworks come with several limitations and it isn’t always possible to maximize performance as this might require writing specific code to map computation to a device. In this thesis we ported several programs implemented in other frameworks and plat- forms to the Marrow framework, which allowed us to better understand its limitations and further extend and optimize the framework. An iterative process was used, where we started by analyzing how a given program was implemented on a given framework, secondly we investigated if the program could be implemented in Marrow’s current state. If not, we extended Marrow by improving its features, in order to make the implementa- tion possible. Then we implemented and benchmarked the given program, and used the performance comparisons as a tool to further optimize the framework. With the development of this thesis we managed to implement several applications with the Marrow framework, which allowed us to add several new features such as the inclusive scan, matrix multiplication operation, the zip and unzip functions, and we significantly improved the flexibility of Marrow’s constructs such as that of Marrow’s exclusive scan. Furthermore, we managed to better understand Marrow’s performance bottlenecks through the Marrow profiler, and optimize asynchronous memory transfers.
Main Authors:	Cardoso, Francisco José Sampaio de Freitas
Subject:	Heterogeneous Computing Marrow CUDA GPU
Year:	2022
Country:	Portugal
Document type:	master thesis
Access type:	open access
Associated institution:	Universidade Nova de Lisboa
Language:	English
Origin:	Repositório Institucional da UNL

Description
Summary:	Most computers nowadays are heterogeneous, composed of a Central Processing Unit (CPU) and one or more Graphics Processing Units (GPUs). In order to harness the power of each of these devices, developers must have experience with low-level toolchains such as CUDA, and expert knowledge of the underlying architecture. However these low-level approaches add several layers of complexity to the task at hand. High-level programming models such as the Marrow framework are used to attenuate the arduous task that is offloading computation to accelerator devices. Usually, they do so by abstracting memory management and implicitly parallelizing workloads by exposing high-level constructs to the programmer. However, these frameworks come with several limitations and it isn’t always possible to maximize performance as this might require writing specific code to map computation to a device. In this thesis we ported several programs implemented in other frameworks and plat- forms to the Marrow framework, which allowed us to better understand its limitations and further extend and optimize the framework. An iterative process was used, where we started by analyzing how a given program was implemented on a given framework, secondly we investigated if the program could be implemented in Marrow’s current state. If not, we extended Marrow by improving its features, in order to make the implementa- tion possible. Then we implemented and benchmarked the given program, and used the performance comparisons as a tool to further optimize the framework. With the development of this thesis we managed to implement several applications with the Marrow framework, which allowed us to add several new features such as the inclusive scan, matrix multiplication operation, the zip and unzip functions, and we significantly improved the flexibility of Marrow’s constructs such as that of Marrow’s exclusive scan. Furthermore, we managed to better understand Marrow’s performance bottlenecks through the Marrow profiler, and optimize asynchronous memory transfers.