PhD: Device-wide Barrier Synchronisation on Graphics Processing Units

Abstract

GPUs are parallel devices that are able to run thousands of independent threads concurrently. Traditional GPU programs are data-parallel, requiring little to no communication, i.e. synchronisation, between threads. However, classical concurrency in the context of CPUs often exploits synchronisation idioms that are not supported on GPUs. By studying such idioms on GPUs, with an aim to facilitate them in a portable way, we can enable a wider space of GPU applications. While the breadth of this thesis extends to many aspects of GPU systems, the common thread throughout is the global barrier: an execution barrier that synchronises all threads executing a GPU application. The idea of such a barrier might seem straightforward, however this investigation reveals many challenges and insights. In particular, this thesis includes the following studies: Execution models: while a general gobal barrier can deadlock due to starvation on GPUs, we show that the scheduling guarantees of current GPUs can be used to dynamically create an execution environment that allows for a safe, and portable global barrier across a subset of the threads. Application optimisations: we examine a set GPU optimisations, including one optimisation enabled by our global barrier, targeted for graph applications. We show that these optimisations can provided substantial performance improvements, e.g., the barrier optimisation achieves up to a 14x and 16x speedup on AMD and Intel GPUs, respectively. We investigate the performance portability of these optimisations, as their utility varies across input, application, and architecture. Multitasking: because many GPUs don’t support preemption, long running GPU compute applications (e.g., applications that use the global barrier) may block other GPU functions, including graphics. We propose a simple cooperative multitasking scheme that allows graphics tasks to meet their deadlines with reasonable overheads.