cs179/GPU cuBLAS and Graphics

The use of Graphics Processing Units for rendering is well known, but their power for general parallel computation has only recently been explored. Parallel algorithms running on GPUs can often achieve up to 100x speedup over similar CPU algorithms, with many existing applications for physics simulations, signal processing, financial modeling, neural networks, and countless other fields.
This course will cover programming techniques for the GPU. The course will introduce NVIDIA’s parallel computing language, CUDA. Beyond covering the CUDA programming model and syntax, the course will also discuss GPU architecture, high performance computing on GPUs, parallel algorithms, CUDA libraries, and applications of GPU computing.

Problem sets will cover performance optimization and specific GPU applications such as numerical mathematics, medical imaging, finance, and other fields.

Course Link

Goals for this week:

  • How do we use cuBLAS to accelerate linear algebra computations with already optimized implementations of the Basic Linear Algebra Subroutines (BLAS).
  • How can we use cuBLAS to perform multiple computations in parallel.
  • Learn about the cuBLAS API and why it sucks to read.
  • Learn to use cuBLAS to write optimized cuda kernels for graphics, which we will also use later for machine learning.

What is BLAS?

https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

  • BLAS defines a set of common functions we would want to apply to scalars, vectors, and matrices.
  • Libraries that implement it exist in almost all major languages.
  • The names of these functions are opaque and hard to follow, so keeping an index nearby is useful. There are different functions for different number types

Some cuBLAS functions

  • cublasIsamax() : Is - a - max. finds the smallest (first) index in a vector that is a maximum for that vector
  • cublasSgemm() : generalized matrix matrix multiplication with single precision floats. Also how you do Vector Vector multiplication.
  • cublasDtrmm() : triangular matrix matrix multiplication with double precision floats. See what I mean?

The symbols used throughout these slides will be consistent to the following:

  • Scalars: 𝛂, 𝜷
  • Vectors: 𝛘, 𝛄
  • Matrices: A, B, C

BLAS (Basic Linear Algebra Subprograms) was written for FORTRAN and cuBLAS follows its conventions. Matrices are indexed column major.

There are 3 “levels” of functionality:

  • Level 1: Scalar and Vector, Vector and Vector operations, 𝛄 → 𝛂𝛘 + 𝛄
  • Level 2: Vector and Matrix operations, 𝛄 → 𝛂A𝛘 + 𝜷𝛄
  • Level 3: Matrix and Matrix operations, C → 𝛂AB + 𝜷C

What is cuBLAS good for?

Anything that uses heavy linear algebra computations (on dense matrices) can benefit from GPU acceleration

  • Graphics
  • Machine learning (this will be covered next week)
  • Computer vision
  • Physical simulations
  • Finance
  • etc…..

The various cuBLAS types

All of the functions defined in cuBLAS have four versions which correspond to the four types of numbers in CUDA C

  • S, s : single precision (32 bit) real float
  • D, d : double precision (64 bit) real float
  • C, c : single precision (32 bit) complex float (implemented as a float2)
  • Z, z : double precision (64 bit) complex float
  • H, h : half precision (16 bit) real float

cuBLAS function types

cublasSgemm → cublas S gemm

  • cublas : the prefix
  • S : single precision real float
  • gemm : general matrix-matrix multiplication

cublasHgemm

  • Same as before except half precision

cublasDgemv → cublas D gemv

  • D : double precision real float
  • gemv : general matrix vector multiplication

Numpy vs math vs cuBLAS

"1"

"2"

Error Checking

  • Like CUDA and cuFFT, cuBLAS has a similar but slightly different status return type.
  • cublasStatus_t
  • We will use a similar macro to gpuErrchk and the cuFFT version to check for cuBLAS errors.

Streaming Parallelism

  • Use cudaStreamCreate() to create a stream for computation.
  • Assign the stream to a cublas library routine using cublasSetStream()
  • When you call the routine it will operate in parallel (asynchronously) with other streaming cublas calls to maximize parallelism.
  • Should pass your constant scalars by reference to help maximize this benefit.

Cublas Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Includes, cuda */
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <helper_cuda.h>

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

/* Matrix size */
#define N (275)

a[IDX2C(0, 1, 50)]

/* Host implementation of a simple version of sgemm */
static void simple_sgemm(int n, float alpha, const float *A, const float *B,
float beta, float *C)
{
int i;
int j;
int k;

for (i = 0; i < n; ++i)
{
for (j = 0; j < n; ++j)
{
float prod = 0;

for (k = 0; k < n; ++k)
{
prod += A[k * n + i] * B[j * n + k];
}

C[j * n + i] = alpha * prod + beta * C[j * n + i];
}
}
}

/* Main */
int main(int argc, char **argv)
{
cublasStatus_t status;
float *h_A;
float *h_B;
float *h_C;
float *h_C_ref;
float *d_A = 0;
float *d_B = 0;
float *d_C = 0;
float alpha = 1.0f;
float beta = 0.0f;
int n2 = N * N;
int i;
float error_norm;
float ref_norm;
float diff;
cublasHandle_t handle;

status = cublasCreate(&handle);

/* Allocate host memory for the matrices */
h_A = (float *)malloc(n2 * sizeof(h_A[0]));
h_B = (float *)malloc(n2 * sizeof(h_B[0]));
h_C = (float *)malloc(n2 * sizeof(h_C[0]));

/* Fill the matrices with test data */
for (i = 0; i < n2; i++)
{
h_A[i] = rand() / (float)RAND_MAX;
h_B[i] = rand() / (float)RAND_MAX;
h_C[i] = rand() / (float)RAND_MAX;
}

/* Allocate device memory for the matrices */
cudaMalloc((void **)&d_A, n2 * sizeof(d_A[0]));
cudaMalloc((void **)&d_B, n2 * sizeof(d_B[0]));
cudaMalloc((void **)&d_C, n2 * sizeof(d_C[0]));

/* Initialize the device matrices with the host matrices */
status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);
status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);
status = cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);

/* Performs operation using plain C code */
simple_sgemm(N, alpha, h_A, h_B, beta, h_C);
h_C_ref = h_C;

/* Performs operation using cublas */
status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);

/* Allocate host memory for reading back the result from device memory */
h_C = (float *)malloc(n2 * sizeof(h_C[0]));

/* Read the result back */
status = cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);

/* Check result against reference */
error_norm = 0;
ref_norm = 0;

for (i = 0; i < n2; ++i)
{
diff = h_C_ref[i] - h_C[i];
error_norm += diff * diff;
ref_norm += h_C_ref[i] * h_C_ref[i];
}

error_norm = (float)sqrt((double)error_norm);
ref_norm = (float)sqrt((double)ref_norm);

/* Memory clean up */
free(h_A);
free(h_B);
free(h_C);
free(h_C_ref);

cudaFree(d_A);
cudaFree(d_B);

cudaFree(d_C);

/* Shutdown */
status = cublasDestroy(handle);
}