## **GPU Concurrency**

# Weak Behaviours and Programming Assumptions

Jade Alglave<sup>1,2</sup>, Mark Batty<sup>3</sup>, Alastair F. Donaldson<sup>4</sup>, Ganesh Gopalakrishnan<sup>5</sup>, Jeroen Ketema<sup>4</sup>, Daniel Poetzl<sup>6</sup>, Tyler Sorensen<sup>2,5</sup> and John Wickerson<sup>4</sup>

<sup>1</sup>Microsoft Research Cambridge, <sup>2</sup>University College London, <sup>3</sup>University of Cambridge, <sup>4</sup>Imperial College London, <sup>5</sup>University of Utah, <sup>6</sup>University of Oxford

diy [1]

### Motivation

- Multicore systems (e.g. Nvidia GPUs) implement weak *memory models* [1]; i.e. executions that do not correspond to an interleaving of concurrent instructions are observable.
- Documentation for such behaviours is often sparse and written in prose, which is prone to misinterpretations and can lead to bugs in applications.

## Methodology

generates systematic families of litmus tests

#### **GPU** additions

concurrency and memory hierarchies (e.g. scopes)

We explore which weak behaviours are experimentally observable on GPU chips; we compare our results to GPU applications containing synchronisation idioms. Finally, we give a formal GPU model which is sound w.r.t. our experimental data.

| litmus [1] | generates and executes code of a litmus test    | stresses the system to<br>increase the likelihood<br>of weak behaviours |
|------------|-------------------------------------------------|-------------------------------------------------------------------------|
| herd [1]   | simulates a formal model<br>given as a cat file | cat file for Nvidia PTX                                                 |
| targets    | ARM, IBM PowerPC, and<br>Intel x86 CPUs         | AMD and Nvidia GPUs                                                     |

## Examples of observed behaviours









(d)

Pictures computed using an octree given in **GPU Computing** Gems: Jade Edition [2] on an Nvidia Tesla C2075. Errors in picture (a) are due to weak memory behaviours. Picture (b) is from code that has been experimentally fixed by us.

### Formal model

We developed a formal model given as a cat file [1] for GPUs which is sound for over **10,000** litmus tests run on **5** Nvidia chips over **3** architectures (Fermi, Kepler, Maxwell)

> 'global x=0 **final:** r0=1 ∧ r2=0 threads: intra-CTA

Pictures computed using a hash table in **CUDA by Example** [3] on an Nvidia Tesla C2075. Errors in picture (c) are due to weak memory behaviours. Picture (d) is from code that has been experimentally fixed by us. *Led to an official Nvidia erratum* [4].

## Supplementary material

**Paper:** GPU Concurrency: Weak Behaviours and Programming Assumptions Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen and John Wickerson. ASPLOS '15.

### http://virginia.cs.ucl.ac.uk/sunflowers/asplos15

| a | st.cg [x],1 | c ld.cg r0,[y] |
|---|-------------|----------------|
|   | membar.cta  | membar.gl      |
| b | st.cg [y],1 | d ld.cg r2,[x] |

init:



#### http://youtu.be/3-Y8xLsqywY Video:

#### **Simulator:** <u>http://virginia.cs.ucl.ac.uk/herd-web/?book=ptx</u>

#### **References**:

Data:

[1] J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data mining for weak memory. TOPLAS '14 [2] Wen-mei W. Hwu. GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers Inc., 2011

[3] J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison Wesley Professional, 2010

[4] <u>https://developer.nvidia.com/cuda-example-errata-page</u>



Support: EPSRC grants EP/{H005633, H008373, K008528, K011499, K039431}, EU FP7 project CARP (287767), NSF CCF Awards 1346756 and 1302449, and SRC project 2269.002