cs179/GPU cuBLAS and Graphics

2020-03-29 CUDA cs179, cuda 0 评论

The use of Graphics Processing Units for rendering is well known, but their power for general parallel computation has only recently been explored. Parallel algorithms running on GPUs can often achieve up to 100x speedup over similar CPU algorithms, with many existing applications for physics simulations, signal processing, financial modeling, neural networks, and countless other fields.
This course will cover programming techniques for the GPU. The course will introduce NVIDIA’s parallel computing language, CUDA. Beyond covering the CUDA programming model and syntax, the course will also discuss GPU architecture, high performance computing on GPUs, parallel algorithms, CUDA libraries, and applications of GPU computing.

Problem sets will cover performance optimization and specific GPU applications such as numerical mathematics, medical imaging, finance, and other fields.

Course Link

Goals for this week:

How do we use cuBLAS to accelerate linear algebra computations with already optimized implementations of the Basic Linear Algebra Subroutines (BLAS).
How can we use cuBLAS to perform multiple computations in parallel.
Learn about the cuBLAS API and why it sucks to read.
Learn to use cuBLAS to write optimized cuda kernels for graphics, which we will also use later for machine learning.

What is BLAS?

https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

BLAS defines a set of common functions we would want to apply to scalars, vectors, and matrices.
Libraries that implement it exist in almost all major languages.
The names of these functions are opaque and hard to follow, so keeping an index nearby is useful. There are different functions for different number types

Some cuBLAS functions

cublasIsamax() : Is - a - max. finds the smallest (first) index in a vector that is a maximum for that vector
cublasSgemm() : generalized matrix matrix multiplication with single precision floats. Also how you do Vector Vector multiplication.
cublasDtrmm() : triangular matrix matrix multiplication with double precision floats. See what I mean?

The symbols used throughout these slides will be consistent to the following:

Scalars: 𝛂, 𝜷
Vectors: 𝛘, 𝛄
Matrices: A, B, C

BLAS (Basic Linear Algebra Subprograms) was written for FORTRAN and cuBLAS follows its conventions. Matrices are indexed column major.

There are 3 “levels” of functionality:

Level 1: Scalar and Vector, Vector and Vector operations, 𝛄 → 𝛂𝛘 + 𝛄
Level 2: Vector and Matrix operations, 𝛄 → 𝛂A𝛘 + 𝜷𝛄
Level 3: Matrix and Matrix operations, C → 𝛂AB + 𝜷C

What is cuBLAS good for?

Anything that uses heavy linear algebra computations (on dense matrices) can benefit from GPU acceleration

Graphics
Machine learning (this will be covered next week)
Computer vision
Physical simulations
Finance
etc…..

The various cuBLAS types

All of the functions defined in cuBLAS have four versions which correspond to the four types of numbers in CUDA C

S, s : single precision (32 bit) real float
D, d : double precision (64 bit) real float
C, c : single precision (32 bit) complex float (implemented as a float2)
Z, z : double precision (64 bit) complex float
H, h : half precision (16 bit) real float

cuBLAS function types

cublasSgemm → cublas S gemm

cublas : the prefix
S : single precision real float
gemm : general matrix-matrix multiplication

cublasHgemm

Same as before except half precision

cublasDgemv → cublas D gemv

D : double precision real float
gemv : general matrix vector multiplication

Numpy vs math vs cuBLAS

"1"

"2"

Error Checking

Like CUDA and cuFFT, cuBLAS has a similar but slightly different status return type.
cublasStatus_t
We will use a similar macro to gpuErrchk and the cuFFT version to check for cuBLAS errors.

Streaming Parallelism

Use cudaStreamCreate() to create a stream for computation.
Assign the stream to a cublas library routine using cublasSetStream()
When you call the routine it will operate in parallel (asynchronously) with other streaming cublas calls to maximize parallelism.
Should pass your constant scalars by reference to help maximize this benefit.

Cublas Example

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Includes, cuda */
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <helper_cuda.h>

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

/* Matrix size */
#define N  (275)

a[IDX2C(0, 1, 50)]

/* Host implementation of a simple version of sgemm */
static void simple_sgemm(int n, float alpha, const float *A, const float *B,
                         float beta, float *C)
{
    int i;
    int j;
    int k;

    for (i = 0; i < n; ++i)
    {
        for (j = 0; j < n; ++j)
        {
            float prod = 0;

            for (k = 0; k < n; ++k)
            {
                prod += A[k * n + i] * B[j * n + k];
            }

            C[j * n + i] = alpha * prod + beta * C[j * n + i];
        }
    }
}

/* Main */
int main(int argc, char **argv)
{
    cublasStatus_t status;
    float *h_A;
    float *h_B;
    float *h_C;
    float *h_C_ref;
    float *d_A = 0;
    float *d_B = 0;
    float *d_C = 0;
    float alpha = 1.0f;
    float beta = 0.0f;
    int n2 = N * N;
    int i;
    float error_norm;
    float ref_norm;
    float diff;
    cublasHandle_t handle;

    status = cublasCreate(&handle);

    /* Allocate host memory for the matrices */
    h_A = (float *)malloc(n2 * sizeof(h_A[0]));
    h_B = (float *)malloc(n2 * sizeof(h_B[0]));
    h_C = (float *)malloc(n2 * sizeof(h_C[0]));

    /* Fill the matrices with test data */
    for (i = 0; i < n2; i++)
    {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
        h_C[i] = rand() / (float)RAND_MAX;
    }

    /* Allocate device memory for the matrices */
    cudaMalloc((void **)&d_A, n2 * sizeof(d_A[0]));
    cudaMalloc((void **)&d_B, n2 * sizeof(d_B[0]));
    cudaMalloc((void **)&d_C, n2 * sizeof(d_C[0]));

    /* Initialize the device matrices with the host matrices */
    status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);
    status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);
    status = cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);

    /* Performs operation using plain C code */
    simple_sgemm(N, alpha, h_A, h_B, beta, h_C);
    h_C_ref = h_C;

    /* Performs operation using cublas */
    status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);

    /* Allocate host memory for reading back the result from device memory */
    h_C = (float *)malloc(n2 * sizeof(h_C[0]));

    /* Read the result back */
    status = cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);

    /* Check result against reference */
    error_norm = 0;
    ref_norm = 0;

    for (i = 0; i < n2; ++i)
    {
        diff = h_C_ref[i] - h_C[i];
        error_norm += diff * diff;
        ref_norm += h_C_ref[i] * h_C_ref[i];
    }

    error_norm = (float)sqrt((double)error_norm);
    ref_norm = (float)sqrt((double)ref_norm);

    /* Memory clean up */
    free(h_A);
    free(h_B);
    free(h_C);
    free(h_C_ref);

    cudaFree(d_A);
    cudaFree(d_B);

    cudaFree(d_C);

    /* Shutdown */
    status = cublasDestroy(handle);
}

本文链接： http://huangzhiyuan.github.io/2020/03/29/cs179-GPU-Programming-week4/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

黄志远

研发攻城狮