Programming GPUs with SYCL

  • Introduction to GPGPU

    Why program GPUs
    CPU VS GPU architecture
    General GPU programming tips

  • SYCL for OpenCL

    Overview
    Features

  • SYCL example

    Vector add

Introduction to GPGPU

Why program GPUs

  • Need for parallellism to gain performance

    “Free lunch” provided by Moore’s law is Over
    adding even more CPU cores is showing diminishing returns

  • GPUs are extremely efficient for

    data parallel tasks
    Arithmetic heavy computations

CPU VS GPU architecture

"cpugpu"

CPU:

  • task parallellism
  • small number of large cores
  • seperate instrustions on each core independently
  • higher power consumption
  • lower memory bandwidth
  • random memory access

GPU:

  • data parallellism
  • large number of small execution Units
  • single instrustions on all multiple execution units in lock-step
  • lower power consumption
  • higher memory bandwidth
  • sequential memory access

common CPU Architecture

"cpugpu"

common GPU Architecture

"gpugpu"

common system architecture

"commarch"

GPUs execute in lock-step

"lockstep"

GPUs access memory sequentially

"accmem"

General GPU programming tips

  • ensure the task is suitable
    GPUs are most efficient for data parallel tasks
    performance gain from prforming computing > cost of moving data
  • avoid branching
    waves of processing elements execute in lock-step
    both sides of branches execute with the other masked
  • avoid non-coalesced memory access
    GPUs access memory more efficiently if accessed as contiguous blocks
  • avoid exponsive data movement
    the bottleneck in GPU programming is data movement between CPU and GPU memory
    it’s important to have data as clse to the procesing as possible

SYCL for OpenCL

what is OpenCL

  • allows you to write kernels that execute on accelerators
  • allows you to copy data between the host CPU and accelerators
  • supports a wide range of devices
  • comes in two components
    Host side C API for en-queueing kernels and copying Data
    Device side OpenCL C language for writing kernels

Motivation of SYCL

  • make heterogeneous programming more accessible
    provide a foundation for efficient and portable templeate algorithms
  • create a C++ for OpenCL ecosystem
    define an open portable standard
    provide the performance and portability of OpenCL
    base only on standard C++
  • provide a high-level shared source model
    provide a high-level abstraction over OpenCL boiler plate Code
    allow C++ template libraries to target OpenCL
    allow type safety across host and device

"sycl"

"ecosystem"

how does shared source work? Regular C++ (Single Source)
"Regularc"

how does shared source work? OpenCL (separate Source)
"opencl"

how does shared source work? SYCL (shared Source)
"sycl-shared-source"

"separate"

"dependency"

Suported Subset of C++ in Device Code

Suported Features

  • static polymorphism
  • lambdas
  • classes
  • operator overloading
  • templeates
  • placement new

Non-Suported Features

  • dynamic polymorphism
  • dynamic allocation
  • exception handling
  • RTTI
  • static variables
  • function pointers

"highapi"

SYCL example

Vector add

"add1"
"add2"
"add3"
"add4"
"add5"
"add6"
"add7"
"add8"