gpu.Rmd

An Introduction to Using GPUs for Computation
==================================================================
Chris Paciorek, Statistical Computing Facility, Department of Statistics, UC Berkeley

Presented: April 25, 2014

Last Revised: April 30, 2014


```{r setup, include=FALSE}
opts_chunk$set(cache = TRUE) # because the compilation takes time, let's cache it
```

# 0) This Tutorial

Materials for this tutorial, including the R markdown file that was used to create this document are available on github at `https://github.com/berkeley-scf/git-workshop-2014`.  You can download the files by doing a git clone:
```{clone, eval=FALSE, engine='bash'}
git clone https://github.com/berkeley-scf/gpu-workshop-2014
```

To create this HTML document, simply compile the corresponding R Markdown file in R:
```{rmd-compile, eval=FALSE}
library(knitr)
knit2html('gpu.Rmd')
```

# 1) Introduction

### 1.1) Overview

GPUs (Graphics Processing Units) are processing units originally designed for rendering graphics on a computer quickly. This is done by having a large number of simple processing units for massively parallel calculation. The idea of general purpose GPU (GPGPU) computing is to exploit this capability for general computation. 

We'll see some high-level and somewhat lower-level ways to program calculations for implementation on the GPU. The basic context of GPU programming is "data parallelism", in which the same calculation is done to lots of pieces of data. This could be a mathematical calcuation on millions of entries in a vector or a simulation with many independent simulations. Some examples of data parallelism include matrix multiplication (doing the multiplication task on many separate matrix elements) or numerical integration (doing a numerical estimate of the piecewise integral on many intervals/regions), as well as standard statistical calculations such as simulation studies, bootstrapping, random forests, etc. This kind of computation also goes by the name `SIMD` (single instruction, multiple data).

### 1.2) Hardware

Two of the main suppliers of GPUs are NVIDIA and AMD. `CUDA` is a platform for programming on GPUs specifically for NVIDIA GPUs that allows you to send C/C++/Fortran code for execution on the GPU.  `OpenCL` is an alternative that will work with a broader variety of GPUs. However, CUDA is quite popular, and since Amazon EC2 provides NVIDIA GPUs we'll use CUDA here. 

GPUs have many processing units but limited memory. Also, they can only use data in their own memory, not in the CPU's memory, so one must transfer data back and forth between the CPU (the `host`) and the GPU (the `device`). This copying can, in some computations, constitute a very large fraction of the overall computation. So it is best to create the data and/or leave the data (for subsequent calculations) on the GPU when possible and to limit transfers. 

The `g2.2xlarge` Amazon EC2 instance types have 1536 cores and 4 Gb memory. They're of the `Kepler` architecture (3rd generation). The 2nd generation was `Fermi` and the 1st was `Tesla`. (However note that `Tesla` is also used by NVIDIA to refer to different chip types, so for example the `cg1.4xlarge` Amazon EC2 instances have chips that are `NVIDIA Tesla M2050 GPUs ("Fermi" GF100)`, but are the `Fermi` architecture.) Originally GPUs supported only single precision (i.e., `float` calculations) but fortunately they now support double precision operations and all of the examples here will use doubles to avoid potential numerical issues, in particular with linear algebra calculations. 

#### Demonstration Using Amazon's EC2 

Since the SCF does not have any machines with a GPU, we'll need to use a cloud-based machine. Amazon's EC2 provide two types of GPU instances: `g2.2xlarge` and `cg1.4xlarge`. The first is more recent, though in some of my tests cg1.4xlarge was actually faster. However given that the price for g2.2xlarge is 65 cents per hour and cg1.4xlarge is more than $2 per hour, we'll use g2.2xlarge.

I've created an Amazon machine image (an AMI) that is the binary representation of the Linux Ubuntu operating system for a machine with support for GPU calculations. The AMI contains the following software and packages: R and RCUDA, Python and PyCUDA, CUDA, and MAGMA.  In other respects the AMI is similar to the SCF and EML Linux machines but with a reduced set of software.

Based on this AMI I've started a virtual machine (VM) that we can login to (see below for instructions) via SSH, just like any SCF/EML Linux server.

If you were willing to pay Amazon and had an account you can start a VM (in the Oregon [us-west-2] region) using the SCF AMI by searching for "Public Images" at the [EC2 console](https://console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:) for `scf-gpu_0.3`. Then just launch a VM, selecting `g2.2xlarge` under the `GPU instances` tab. Alternatively, if you are using StarCluster (e.g., [this tutorial](http://statistics.berkeley.edu/computing/cloud-computation) provides some info on using StarCluster with EC2 to start up VMs or clusters of VMs), you can start a VM using the SCF AMI by setting the following in the StarCluster `config` file:

```{starcluster-config, eval=FALSE, engine='bash'}
AWS_REGION_NAME = us-west-2
AWS_REGION_HOST = ec2.us-west-2.amazonaws.com
NODE_IMAGE_ID = ami-b0374280 
NODE_INSTANCE_TYPE = g2.2xlarge
```


Note that the EML (Economics) has a GPU on one of the EML Linux servers that EML users can access. If this is of interest to you, email consult@econ.berkeley.edu, and I will work to get it set up analogously to the Amazon VM and to help you get started. 

And note that Biostatistics has a GPU on one of its servers. Talk to Burke for more information.

#### The Amazon VM

I'll start up the Amazon VM, calling it `gpuvm` and ssh to it using my own Starcluster config file:

```{start-VM, engine='bash'}
starcluster start -c gpu gpuvm
starcluster sshmaster -u paciorek gpuvm
```

I also need to make sure that CUDA-related executables are in my path (they should already be set up for the `ubuntu` default user):

```{path, engine='bash'}
export PATH=${PATH}:/usr/local/cuda/bin
echo "" >> ~/.bashrc
echo "export PATH=${PATH}:/usr/local/cuda/bin" >> ~/.bashrc
echo "" >> ~/.bashrc
echo "alias gtop=\"nvidia-smi -q -g 0 -d UTILIZATION -l 1\"" >> ~/.bashrc

```

For the moment, you can connect to the Amazon VM I am using yourself. Here's what you need to do.

* copy the ssh key file, `gpu_rsa` that SCF provided access to (via email) to your computer (on a UNIX-like machine, including Macs), put it in `~/.ssh`)
* open a terminal window on a UNIX-alike machine (you might be able to ssh via putty or the like if you can point it to the key file you just copied to your machine) and ssh to the VM as follows, using the IP info provided by SCF (via email):

```{ssh, engine='bash'}
export ip=VALUE_OBTAINED_FROM_SCF
ssh -i ~/.ssh/gpu_rsa ubuntu@${ip}.us-west-2.compute.amazonaws.com
```

* since multiple people are sharing this VM and are all logging in as the 'ubuntu' user, please make a directory ~/ubuntu/YourUserName and only work within that directory

#### Observing Performance on the GPU

The following command will allow you to see some information analogous to `top` on the CPU. 

```{gtop, engine='bash'}
gtop
```

Here's some example output when the GPU is idle: 

```{gtop output, engine='bash', eval=FALSE}
==============NVSMI LOG==============

Timestamp                           : Mon Apr  7 21:15:39 2014
Driver Version                      : 319.37

Attached GPUs                       : 1
GPU 0000:00:03.0
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
```


### 1.4) Software Tools

Here are some of the useful software tools for doing computations on the GPU.

* CUDA - platform for programming on an NVIDIA GPU using C/C++/Fortran code
* CUBLAS - a BLAS implementation for matrix-vector calculations on an NVIDIA GPU
* CURANDOM - random number generation on an NVIDIA GPU
* MAGMA - a package for combined CPU-GPU linear algebra, intended to be analogous to LAPACK + BLAS
* RCUDA - an R package providing a front-end for CUDA
* R's magma package - a front-end for MAGMA
* PyCUDA - a Python package providing a front-end for CUDA

Note that RCUDA is still in development and is on Github, but should be high-quality as it is developed by Duncan Temple Lang at UC-Davis.

We'll see all of these in action.

There are also:
* openCL - an alternative to CUDA that can also be used with non-NVIDIA GPUs
* PyOpenCL
* R's OpenCL package

#### A Note on Synchronization    

Note that in the various examples when I want to assess computational time, I make sure to synchronize the GPU via an appropriate function call. This ensures that all of the kernels have finished their calculations before I mark the end of the time interval. In general a function call to do a calculation on the GPU will simply start the calculation and then return, with the calculation continuing on the GPU.

# 2) Using Kernels for Parallel Computation

Kernels are functions that encode the core computational operations done on individual pieces of data. The basic mode of operation in this Section will be to write a kernel and then call the kernel on all the elements of a data object via C, R, or Python code. We'll need to pass the data from the CPU to the GPU and do the same in reverse to get the result. We'll also need to allocate memory on the GPU. However in some cases the transfer and allocation will be done automatically behind the scenes.

A note on the speed comparisons in this section. These compare a fully serial CPU calculation on a single core to calculation on the GPU. On a multicore machine, we could speed up the CPU calculation by writing code to parallelize the calculation (e.g., via threading in C/openMP or various parallelization tools in R or Python). 

See my comments in the last Section regarding some tips and references that may enable you to get more impressive speedups than I show in the demos here. 

### 2.1) Background: 

#### Threads and Grids

Each individual computation or series of computations on the GPU is done in a thread. Threads are organized into blocks and blocks of threads are organized in a grid. The blocks and grids can be 1-, 2-, or 3-dimensional. E.g., you might have a 1-d block of 500 threads, with a grid of 3 x 3 such blocks, for a total of $500 \times 9 = 4500$ threads. The choice of the grid/block arrangement can affect efficiency. I can't provide much guidance on that so you'd need to experiment or do some additional research. For our purposes, we'll often use a 2-d grid of 1-d blocks. In general you'd want each independent calculation done in a separate thread, though as we'll see in Section 3 on simulation, one might want to do a sequence of calculations on each thread. In general, you'll want to pipeline together multiple operations within a computation to avoid copying from CPU to GPU and back. Alternatively, this can be done by keeping the data on the GPU and calling a second kernel. 

Threads are quick to start, and to get efficiency you want to have thousands of threads to exploit the parallelism of the GPU hardware. In general your calculations will have more threads than GPU cores.

This can all get quite complicated, with the possibility for communication amongst threads. We won't go into this, but threads within a block shared memory (distinct from the main GPU memory) and can synchronize with each other, while threads in different blocks cannot cooperate. The Suchard et al. paper referenced in the last Section discusses how to get more efficiency by having threads within a block cooperate and access shared memory, which is much faster than accessing the  main GPU (device) memory.

Executing the following code as root will create an executable that will show you details on the GPU, including the possible block and grid dimensions. 
```{deviceQuery, engine='bash', eval=FALSE}
cd  /usr/local/cuda/samples/1_Utilities/deviceQuery
nvcc deviceQuery.cpp -I/usr/local/cuda/include \
   -I/usr/local/cuda-5.5/samples/common/inc -o /usr/local/cuda/bin/deviceQuery
cd -
```

Now running `deviceQuery` will show output like the following (on the SCF VM):
```{deviceQuery output, engine='bash', eval=FALSE}
paciorek@master:~$ deviceQuery
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Clock rate:                                797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GRID K520
Result = PASS
```

In particular note the information on the number of CUDA cores, the GPU's memory, and the information on the maximum threads per block and the maximum dimensions of thread blocks and grids.

#### GPU Calculations and Kernels

The basic series of operations is:
* allocate memory on the GPU
* transfer data from CPU to GPU
* launch the kernel to operate on the threads, with a given block/grid arrangement
* [optionally] launch another kernel, which can access data stored on the GPU, including results from the previous kernel
* transfer results back to CPU

Some of this is obscured because CUDA, RCUDA, and PyCUDA do some of the work for you (and also obscured if you use pinned memory).

When we write a kernel, we will need to have some initial code that determines a unique ID for that thread  that allows the thread to access the appropriate part(s) of the data object(s) on the GPU. This is done based on information stored in variables that CUDA provides that have information about the thread and block indices and block and grid dimensions.

### 2.2) Using CUDA Directly

#### Hello, world

First let's see a 'Hello, World' example that illustrates blocks of threads and grids of blocks.

The idea is to have at least as many threads as the number of computations you are doing. Our kernel function contains the core calculation we want to do (in this case printing 'Hello world!' and code that figures out the unique ID of each thread, as this is often used within a calculation.

Here's the [example code (helloWorld.cu on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/helloWorld.cu).

In this case, compilation is as follows. Given the CUDA functionality used in the code (in particular the call to `printf` within the kernel), we need to specify compilation for a `compute capability` >= 2.0 (corresponding to the Fermi generation of NVIDIA GPUs). Note that our query above indicated that the GPU we are using has capability 3.0, so 

```{helloWorld-compile, engine='bash', eval=FALSE}
nvcc helloWorld.cu -arch=compute_20 -code=sm_20,compute_20 -o helloWorld
```

The result of this looks like:
```{helloWorld-output, eval=FALSE, engine='bash'}
Launching 20480 threads (N=20000)
Hello world! My block index is (3,0) [Grid dims=(20,2)], 3D-thread index within block=(448,0,0) => thread index=1984
Hello world! My block index is (3,0) [Grid dims=(20,2)], 3D-thread index within block=(449,0,0) => thread index=1985
Hello world! My block index is (3,0) [Grid dims=(20,2)], 3D-thread index within block=(450,0,0) => thread index=1986
....

Hello world! My block index is (19,1) [Grid dims=(20,2)], 3D-thread index within block=(220,0,0) => thread index=20188 
[### this thread would not be used for N=20000 ###]
Hello world! My block index is (19,1) [Grid dims=(20,2)], 3D-thread index within block=(221,0,0) => thread index=20189 
[### this thread would not be used for N=20000 ###]
Hello world! My block index is (19,1) [Grid dims=(20,2)], 3D-thread index within block=(222,0,0) => thread index=20190 
[### this thread would not be used for N=20000 ###]
Hello world! My block index is (19,1) [Grid dims=(20,2)], 3D-thread index within block=(223,0,0) => thread index=20191 
[### this thread would not be used for N=20000 ###]
kernel launch success!
That's all!
```

Note that because of some buffering issues, with this many threads, we can't see the output for all of them, hence the `if` statement in the kernel code. It is possible to retrieve info about the limit and change the limit using `cudaDeviceGetLimit()` and `cudaDeviceSetLimit()`.

#### Example of a 'Real' Computation

Now let's see an example of a distributed calculation using CUDA code, including memory allocation on the GPU and transfer between the GPU and CPU. Our example will be computing terms in an IID log-likelihood calculation. In this case we'll just use the normal density, but real applications would of course have more involved calculation.

Note that here, I'll use 1024 (the maximum based on `deviceQuery`) threads per block and then a grid (2-d for simplicity) sufficiently large so that we have at least as many threads as computational chunks. 

Here's the [code (kernelExample.cu on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/kernelExample.cu).


Compilation is as follows. We again need to specify a compute capability >= 2.0, in this case in order to do calculations with doubles rather than floats.


```{kernelExample-compile, engine='bash', eval=FALSE}
nvcc kernelExample.cu -arch=compute_20 -code=sm_20,compute_20 -o kernelExample
```

Here are some results:
```{kernelExample-output, eval=FALSE, engine='bash'}
====================================================
Grid dimension is 46 x 46
Launching 2166784 threads (N=2097152)
Input values: -0.658344 0.499804 -0.807257...
Memory Copy from Host to Device successful.
Memory Copy from Device to Host successful.
Output values: 0.321214 0.352100 0.288007...
Output values (CPU): 0.321214 0.352100 0.288007...
Timing results for n = 2097152
Transfer to GPU time: 0.008920
Calculation time (GPU): 0.001766
Calculation time (CPU): 0.070951
Transfer from GPU time: 0.001337
Freeing memory...
====================================================
...
====================================================
Grid dimension is 363 x 363
Launching 134931456 threads (N=134217728)
Input values: -0.658344 0.499804 -0.807257...
Memory Copy from Host to Device successful.
Memory Copy from Device to Host successful.
Output values: 0.321214 0.352100 0.288007...
Output values (CPU): 0.321214 0.352100 0.288007...
Timing results for n = 134217728
Transfer to GPU time: 0.556857
Calculation time (GPU): 0.110254
Calculation time (CPU): 4.605744
Transfer from GPU time: 0.068865
Freeing memory...
====================================================
```

We see that the time for transferring to and from (particularly to) the GPU exceeds the calculation time, reinforcing the idea of keeping data on the GPU when possible. 

#### Using Pinned Memory

Here's some code where we use pinned memory that is 'mapped' to the GPU such that the GPU directly accesses CPU memory. This can be advantageous if one exceeds the GPU's memory and, according to some sources, is best when you load the data only once. Another approach, using pinned but not mapped memory allows for more efficient transfer but without the direct access from the GPU, with a hidden transfer done behind the scenes. This may be better if the data is loaded multiple times on the GPU.

Here's the [code (kernelExample-pinned.cu on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/kernelExample-pinned.cu).


Here are some results:
```{kernelExample-pinned-output, eval=FALSE, engine='bash'}
====================================================
Grid dimension is 46 x 46
Launching 2166784 threads (N=2097152)
Input values: -0.658344 0.499804 -0.807257...
Output values: 0.321214 0.352100 0.288007...
Output values (CPU): 0.321214 0.352100 0.288007...
Timing results for n = 2097152
Calculation time (GPU): 0.002080
Calculation time (CPU): 0.071038
Freeing memory...
====================================================
...
====================================================
Grid dimension is 363 x 363
Launching 134931456 threads (N=134217728)
Input values: -0.658344 0.499804 -0.807257...
Output values: 0.321214 0.352100 0.288007...
Output values (CPU): 0.321214 0.352100 0.288007...
Timing results for n = 134217728
Calculation time (GPU): 0.255367
Calculation time (CPU): 4.635453
Freeing memory...
====================================================
```

So using pinned mapped memory seems to help quite a bit in this case, as the total time with pinned memory is less than the time used for transfer plus calculation in the previous examples.

### 2.2) Calling CUDA Kernels from R (RCUDA)

When we want to use CUDA from R, the kernel function will remain the same, but the pre- and post-processing is done in R rather than in C. Here's an example, with the same log-likelihood kernel. The CUDA kernel code is saved in a [separate file (calc_loglik.cu on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/calc_loglik.cu) separate file but is identical to that in the full CUDA+C example above (with the exception that we need to wrap the kernel function in `extern "C"`).

Here's the [code (RCUDAexample.R on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/RCUDAexample.R)

In this example we see that we can either transfer data between CPU and GPU manually or have RCUDA do it for us. If we didn't want to overwrite the input, but rather to allocate separate space for the output on the GPU, we could use `cudaAlloc()`.  See `help(.cuda)` for some example code.

We need to compile the kernel into a ptx object file, either outside of R:

```{RCUDAexample-compile, engine='bash', eval=FALSE}
nvcc --ptx  -arch=compute_20 -code=sm_20,compute_20 -o calc_loglik.ptx calc_loglik.cu
```

or inside of R:
```{RCUDAexample-compile-inR, engine='R', eval=FALSE}
ptx = nvcc(file = 'calc_loglik.cu', out = 'calc_loglik.ptx', target = 'ptx', '-arch=compute_20', '-code=sm_20,compute_20')
```

Here are some results:
```{RCUDAexample-output, eval=FALSE, engine='bash'}
Setting cuGetContext(TRUE)...
Grid size:
[1] 363 363   1
Total number of threads to launch =  134931456 
Running CUDA kernel...
Input values:  0.8966972 0.2655087 0.3721239 
Output values:  0.2457292 0.2658912 0.2656543 
Output values (implicit transfer):  0.2457292 0.2658912 0.2656543 
Output values (CPU with R):  0.2457292 0.2658912 0.2656543 
Transfer to GPU time:  0.374 
Calculation time (GPU):  0.078 
Transfer from GPU time:  0.689 
Calculation time (CPU):  9.981 
Combined calculation+transfer via .cuda time (GPU):  4.303 
```

So the transfer time is again substantial in relative terms. Without that time, the speedup would be substantial. Strangely the streamlined call in which RCUDA handles the transfer is quite a bit slower for reasons that are not clear to me, but the RCUDA developer (Duncan Temple Lang at UC Davis) is looking into this.

We can avoid explicitly specifying block and grid dimensions by using the `gridBy` argument to `.cuda`, which we'll see in a later example.


WARNING #1: be very careful that the types of the R objects passed to the kernel match what the kernel is expecting. Otherwise the code can hang without an informative error message.

WARNING #2: Note the use of the `strict=TRUE` argument when passing values to the GPU. This ensures that numeric values are kept as doubles and not coerced to floats. 

### 2.3) Calling CUDA Kernels from Python plus GPU-vectorized Calculations (PyCUDA)

With PyCUDA the kernel code can be directly embedded in the Python script. Otherwise it's fairly similar to the use of RCUDA. Here's the [code (PyCUDAexample.py on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/PyCUDAexample.py)

Here are some results:
```{PyCUDAexample-output, eval=FALSE, engine='bash'}
Generating random normals...
Running GPU code...
Time for calculation (GPU): 1.512139s
Running Scipy CPU code...
Time for calculation (CPU): 21.398803s
Output from GPU: 0.168458 0.174912 0.252148
Output from CPU: 0.168458 0.174912 0.252148
```

WARNING: As was the case with R, be careful that the types of the Python objects passed to the kernel match what the kernel is expecting. 

PyCUDA also provides high-level functionality for vectorized calculations on the GPU. Basically you create a vector stored in GPU memory and then operate on it with a variety of mathematical functions. The modules that do this are `gpuarray` and `cumath`.

Here's the [code (gpuArrayExample.py on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/gpuArrayExample.py)

Here are the timing results. 
```{gpuArrayExample-output, eval=FALSE, engine='bash'}
Transfer to GPU time: 0.314641s
Timing vectorized exponentiation:
GPU array calc time: 0.226006s
CPU calc time: 3.155150s
Timing vectorized dot product/sum of squares:
GPU array calc time: 0.254579s
CPU calc time: 0.088157s
```

So the fully-vectorized calculation sees a pretty good speed-up but the dot product, which involves a reduction (the summation) does not. Also note that there is some compilation that gets done when the code is run the first time that causes the GPU calculation to be slow the first time but not the second time the code is run. 


# 3) Random Number Generation (RNG) on the GPU

RNG is done via the CURAND (CUDA Random Number Generation) library. CURAND provides several different generators including the Mersenne Twister (the default in R). 

### 3.1) Seeds and Sequences

From the CUDA documentation:

`For the highest quality parallel pseudorandom number generation, each experiment should be assigned a unique seed. Within an experiment, each thread of computation should be assigned a unique sequence number. If an experiment spans multiple kernel launches, it is recommended that threads between kernel launches be given the same seed, and sequence numbers be assigned in a monotonically increasing way. If the same configuration of threads is launched, random state can be preserved in global memory between launches to avoid state setup time.`

A lot of important info... we'll interpret/implement much of it in the demo below.

Recall that RNG on a computer involves generation of pseudo-random numbers from a deterministic, periodic sequence. The seed determines where one starts generating from within that sequence. The idea of the sequence numbers is to generate from non-overlapping blocks within the sequence, with each thread getting a different block.  


### 3.2) Calling CURAND via RCUDA

For RNG, we need a kernel to initialize the RNG on each thread and one to do the sampling (though they could be combined in a single kernel). Note that the time involved in initializing the RNG for each thread is substantial. This shouldn't be a problem if one is doing a lot of calculations over time. To amortize this one-time expense, I generate multiple random numbers per thread. Here's the [kernel code (random.cu on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/random.cu).

And here's the [R code (RNGexample.R on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/RNGexample.R) to call the kernel, which looks very similar to the RCUDA code we've already seen.

We get a pretty good speed up, which would be even more impressive if we can set up the calculations such that we don't need to transfer the whole large vector back to the CPU.

```{RNGexample-output, eval=FALSE, engine='bash'}
RNG initiation time:  0.115 
GPU memory allocation time:  0.003 
Calculation time (GPU):  0.256 
Transfer from GPU time:  0.501 
--------------------------------------
Total time (GPU): 0.875
--------------------------------------
Calculation time (CPU):  9.963 
```

Also note the memory cost of the RNG states for the threads, 48 bytes per thread, which could easily exceed GPU memory if one starts up many threads. 

One more note on RCUDA: we can have RCUDA decide on the gridding. Here's a modification of the RNG example to do this:

```{gridBy, eval=FALSE}
.cuda(rnorm, rng_states, dX, N, mu, sigma, N_per_thread, gridBy = nthreads, .numericAsDouble = getOption("CUDA.useDouble", TRUE))
```

At the moment, I'm not sure how to choose the RNG generator from within R. 

### 3.3) Calling CURAND from C and from Python

I may flesh this out at some point, but by looking at the RNG example via RCUDA and the examples of calling kernels from C and Python in the previous section, it should be straightforward to do RNG on the GPU controlled by C or Python.

To choose the generator in C this should work (in this case choosing the Mersenne Twister):
`curandCreateGenerator(CURAND_RNG_PSEUDO_MTGP32)`.

# 4) Linear Algebra on the GPU

We'll start with very high-level use of the GPU by simply calling linear algebra routines that use the GPU. The simplest approach for this is to use R's `magma` package.

Note that in the timing results, I am comparing to timing with the CPU on the VM. The VM reports 8 virtual CPUs but in some of the calculations does not seem to exploit all of the CPUs, so be wary of the head to head comparisons.

### 4.1) Using MAGMA via R

The MAGMA library provides a drop-in for the functionality of the BLAS and LAPACK that carries out linear algebra on both the CPU and GPU, choosing smartly where to do various aspects of the calculation.

R's magma package provides a front-end to MAGMA, with functionality for arithmetic operations, backsolve, matrix multiplication, Cholesky, inverse, crossproduct, LU, QR, and solve. See `help("magma-class")` for a list, as `library(help = magma)` only lists a few of the functions in the package.

[Note for Demo: As we run the calculations on the GPU, let's look at the computation with our gtop utility.]

Here's the [example code (RmagmaExample.R on the github repository)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/RmagmaExample.R).

Note that by default we are using MAGMA's GPU interface. For more about this, see the next section of this document.

Here are some timing results:
```{magma-R-output, eval=FALSE, engine='bash'}
Timing for n= 4096 
GPU time:  3.27
CPU time:  5.92 
Timing for n= 8192 
GPU time:  20.19 
CPU time:  47.04
Check for use of double precision empirically
[1] 0.000000000000000e+00 8.185452315956354e-11
[1] 8433.16034596550344  -20.63245489979067   13.58046013130892
[1] 8433.16034596551981  -20.63245489979058   13.58046013130881
```

Remember to be careful of memory use as GPU's memory may be limited (on the EC2 instance, it's 4 Gb).

Testing the same computation on an 8-core SCF physical machine gives 2.6 seconds for n=4096 and 20.8 seconds for n=8192 using threaded OpenBLAS. On the VM the CPU-based BLAS/LAPACK calculations seems to only be using 2 cores for some reason that is not clear to me, even though the VM has 8 virtual cores.

### 4.2) Using C to Call CUDA, CUDABLAS, and MAGMA

Next let's use CUDA and MAGMA calls directly in C code. Both CUDA (through CUDABLAS) and MAGMA provide access to BLAS functionality, but only MAGMA provides LAPACK-like functionality (i.e., matrix factorizations/decompositions). Note that we'll now need to directly manage memory allocation on the GPU and transferring data back and forth from CPU to GPU.

#### CUDA and CUDABLAS

The code doesn't look too different than C code or calls to BLAS/LAPACK, but we use some CUDA functions and CUDA types. Here's the [example code (cudaBlasExample.c on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/cudaBlasExample.c).


Compilation goes as follows using `nvcc`, the analog to `gcc` when compiling for the GPU. As when compiling standard C code we need to be careful about compiler flags, header files, and linking. Note that in this case nvcc does not want the file to have .C or .cu extension. 
```{cuda-compile, eval=FALSE, engine='bash'}
nvcc cudaBlasExample.c -I/usr/local/cuda/include -lcublas -o cudaBlasExample
```

And here are (some of) the results:
```{cudaBlas-example-output, eval=FALSE, engine='bash'}
Starting
====================================================
Timing results for n = 512
GPU memory allocation time: 0.001930
Transfer to GPU time: 0.000777
Matrix multiply time: 0.003121
Transfer from GPU time: 0.001484
====================================================
Timing results for n = 4096
GPU memory allocation time: 0.002925
Transfer to GPU time: 0.040283
Matrix multiply time: 1.476518
Transfer from GPU time: 0.144702
====================================================
Timing results for n = 8192
GPU memory allocation time: 0.002807
Transfer to GPU time: 0.159034
Matrix multiply time: 11.807786
Transfer from GPU time: 0.535246
```

For (rough) comparison, the $n=8192$ multiplication on one of the SCF cluster nodes in R (using ACML as the BLAS) takes 74 seconds with one core and 11 seconds with 8 cores.

#### MAGMA 

Now let's see the use of MAGMA. MAGMA provides analogous calls as CUDA/CUDABLAS for allocating memory, transferring data, and BLAS calls, as well as LAPACK type calls. Unfortunately the MAGMA documentation online appears to be seriously out-of-date, documenting version 0.2 whereas the current version of the software is 1.4.0.

Note that the LAPACK type calls have a CPU interface and a GPU interface. The GPU interface calls have function names ending in '_gpu' and operate on data objects in GPU memory. The CPU interface calls operate on data objects in CPU memory, handling the transfer to GPU memory as part of the calculation.

Also, one can use 'pinned' memory on the CPU, which can reduce the transfer time for data to and from the GPU. However, it can involve an increase in time for doing the original memory allocation on the CPU. In the example I can control which is used.

Here we'll compare timing for the GPU vs. standard BLAS/LAPACK as well as the CPU and GPU interfaces for the Cholesky.

Here's the [example code (magmaExample.c on the github repo)](https://github.com/berkeley-scf/gpu-workshop-2014/blob/master/magmaExample.c).


Compilation and execution (with and without pinned memory) go as follows. Note we can use gcc and that we need to link in the CPU BLAS and LAPACK since MAGMA uses both CPU and GPU for calculations (plus in this example I directly call BLAS and LAPACK functions).
```{magma-compile, eval=FALSE, engine='bash'}
gcc magmaExample.c -O3 -DADD_ -fopenmp  -DHAVE_CUBLAS -I/usr/local/cuda/include \
   -I/usr/local/magma/include -L/usr/local/cuda/lib64 -L/usr/local/magma/lib -lmagma \
   -llapack -lblas -lcublas -o magmaExample
./magmaExample 1
./magmaExample 0
```

And here are (some of) the results:
```{magma-example-output, eval=FALSE, engine='bash'}
Starting
Setting use_pinned to 1
====================================================
Timing results for n = 512
GPU memory allocation time: 0.001107
Transfer to GPU time: 0.000235
Matrix multiply time (GPU): 0.002965
Matrix multiply time (BLAS): 0.025640
Cholesky factorization time (GPU w/ GPU interface): 0.006872
Cholesky factorization time (GPU w/ CPU interface): 0.006856
Cholesky factorization time (LAPACK): 0.004908
Transfer from GPU time: 0.000252
====================================================
Timing results for n = 4096
GPU memory allocation time: 0.001535
Transfer to GPU time: 0.014882
Matrix multiply time (GPU): 1.471109
Matrix multiply time (BLAS): 9.641088
Cholesky factorization time (GPU w/ GPU interface): 0.303083
Cholesky factorization time (GPU w/ CPU interface): 0.316703
Cholesky factorization time (LAPACK): 1.537566
Transfer from GPU time: 0.016509
./magmaExample 0
====================================================
Timing results for n = 8192
GPU memory allocation time: 0.004967
Transfer to GPU time: 0.063750
Matrix multiply time (GPU): 11.766860
Matrix multiply time (BLAS): 77.529439
Cholesky factorization time (GPU w/ GPU interface): 2.126884
Cholesky factorization time (GPU w/ CPU interface): 2.161343
Cholesky factorization time (LAPACK): 12.017636
Transfer from GPU time: 0.072997

Setting use_pinned to 0
====================================================
Timing results for n = 512
GPU memory allocation time: 0.002136
Transfer to GPU time: 0.001055
Matrix multiply time (GPU): 0.002969
Matrix multiply time (BLAS): 0.029986
Cholesky factorization time (GPU w/ GPU interface): 0.009177
Cholesky factorization time (GPU w/ CPU interface): 0.011693
Cholesky factorization time (LAPACK): 0.004929
Transfer from GPU time: 0.002238
====================================================
Timing results for n = 4096
GPU memory allocation time: 0.002929
Transfer to GPU time: 0.056978
Matrix multiply time (GPU): 1.471277
Matrix multiply time (BLAS): 9.951325
Cholesky factorization time (GPU w/ GPU interface): 0.308102
Cholesky factorization time (GPU w/ CPU interface): 0.356540
Cholesky factorization time (LAPACK): 1.551262
Transfer from GPU time: 0.136033
====================================================
Timing results for n = 8192
GPU memory allocation time: 0.004951
Transfer to GPU time: 0.226058
Matrix multiply time (GPU): 11.767153
Matrix multiply time (BLAS): 78.473712
Cholesky factorization time (GPU w/ GPU interface): 2.125327
Cholesky factorization time (GPU w/ CPU interface): 2.286922
Cholesky factorization time (LAPACK): 12.059586
Transfer from GPU time: 0.545454

```

So we see decent speed-ups both for the matrix multiplication and the Cholesky factorization; however this is in comparison to a CPU calculation that only seems to use 2 of the 8 cores available. 

Using the CPU interface seems to provide a modest speedup (compared to the manual transfer + calculation time) as does using pinned memory. Note that the transfer time is non-negligible in certain ways (but not all ways) in this example.


# 5) Some Final Comments

### 5.1) Some Thoughts on Improving Computational Speed

[Suchard et al (2010; Journal of Computational and Graphical Statistics 19:419](http://www.tandfonline.com/doi/abs/10.1198/jcgs.2010.10016#.U2GTuBUgoWk) and [Lee et al (2010; Journal of Computational and Graphical Statistics 19:769](http://www.tandfonline.com/doi/abs/10.1198/jcgs.2010.10039#.U2GT9BUgoWk) talk about the use of GPUs for statistics. The speedups they see can get as high as 120 times the speed of a single CPU core and 500 times a single CPU core, respectively. Some possible reasons for these more impressive speedups relative to those seen here include:

* Use of single precision floating point calculations. If single precision doesn't affect your calculation substantively, this is worth trying. Particularly on older GPUs (but perhaps still true), single precision was much faster than double precision. However, in tests in which I switched the kernelExample.cu and RNGexample.R/random.cu examples to single precision there was very little change in speed.

* Careful use of shared memory (shared amongst the threads in a block) in place of the main GPU memory (see the Suchard et al. paper)

* Computational tasks that are very arithmetically intensive but with limited memory access (see the Lee et al. paper)

So for some tasks and possibly with some additional coding effort, you may see speedups of 100-200 fold compared to a single CPU core, rather than the 40 fold speedup that is about the best seen in the demos here.

Finally, rather than bringing a large chunk of data back to the CPU, you might do a reduction/aggregation operation (e.g., summing over values) in GPU memory. To do this, here's a [presentation](http://will-landau.com/gpu/lectures/cudac-atomics/cudac-atomics.pdf)‎ that has some useful information. 

### 5.2) A Comment on Compilation

* If you compile CUDA code into an object file, you can link that with other object files (e.g., from C or C++ code) into an executable that can operate on CPU and GPU. This also means you could compile a shared object (i.e., a library)  that you could call from R with .C, .Call, or Rcpp.