Introduction to GPGPU
Why program GPUs
CPU VS GPU architecture
General GPU programming tipsSYCL for OpenCL
Overview
FeaturesSYCL example
Vector add
Introduction to GPGPU
Why program GPUs
Need for parallellism to gain performance
“Free lunch” provided by Moore’s law is Over
adding even more CPU cores is showing diminishing returnsGPUs are extremely efficient for
data parallel tasks
Arithmetic heavy computations
CPU VS GPU architecture
CPU:
- task parallellism
- small number of large cores
- seperate instrustions on each core independently
- higher power consumption
- lower memory bandwidth
- random memory access
GPU:
- data parallellism
- large number of small execution Units
- single instrustions on all multiple execution units in lock-step
- lower power consumption
- higher memory bandwidth
- sequential memory access
common CPU Architecture
common GPU Architecture
common system architecture
GPUs execute in lock-step
GPUs access memory sequentially
General GPU programming tips
- ensure the task is suitable
GPUs are most efficient for data parallel tasks
performance gain from prforming computing > cost of moving data - avoid branching
waves of processing elements execute in lock-step
both sides of branches execute with the other masked - avoid non-coalesced memory access
GPUs access memory more efficiently if accessed as contiguous blocks - avoid exponsive data movement
the bottleneck in GPU programming is data movement between CPU and GPU memory
it’s important to have data as clse to the procesing as possible
SYCL for OpenCL
what is OpenCL
- allows you to write kernels that execute on accelerators
- allows you to copy data between the host CPU and accelerators
- supports a wide range of devices
- comes in two components
Host side C API for en-queueing kernels and copying Data
Device side OpenCL C language for writing kernels
Motivation of SYCL
- make heterogeneous programming more accessible
provide a foundation for efficient and portable templeate algorithms - create a C++ for OpenCL ecosystem
define an open portable standard
provide the performance and portability of OpenCL
base only on standard C++ - provide a high-level shared source model
provide a high-level abstraction over OpenCL boiler plate Code
allow C++ template libraries to target OpenCL
allow type safety across host and device
how does shared source work? Regular C++ (Single Source)
how does shared source work? OpenCL (separate Source)
how does shared source work? SYCL (shared Source)
Suported Subset of C++ in Device Code
Suported Features
- static polymorphism
- lambdas
- classes
- operator overloading
- templeates
- placement new
Non-Suported Features
- dynamic polymorphism
- dynamic allocation
- exception handling
- RTTI
- static variables
- function pointers