- Introduction to GPGPU - Why program GPUs 
 CPU VS GPU architecture
 General GPU programming tips
- SYCL for OpenCL - Overview 
 Features
- SYCL example - Vector add 
Introduction to GPGPU
Why program GPUs
- Need for parallellism to gain performance - “Free lunch” provided by Moore’s law is Over 
 adding even more CPU cores is showing diminishing returns
- GPUs are extremely efficient for - data parallel tasks 
 Arithmetic heavy computations
CPU VS GPU architecture

CPU:
- task parallellism
- small number of large cores
- seperate instrustions on each core independently
- higher power consumption
- lower memory bandwidth
- random memory access
GPU:
- data parallellism
- large number of small execution Units
- single instrustions on all multiple execution units in lock-step
- lower power consumption
- higher memory bandwidth
- sequential memory access
common CPU Architecture

common GPU Architecture

common system architecture

GPUs execute in lock-step

GPUs access memory sequentially

General GPU programming tips
- ensure the task is suitable
 GPUs are most efficient for data parallel tasks
 performance gain from prforming computing > cost of moving data
- avoid branching
 waves of processing elements execute in lock-step
 both sides of branches execute with the other masked
- avoid non-coalesced memory access
 GPUs access memory more efficiently if accessed as contiguous blocks
- avoid exponsive data movement
 the bottleneck in GPU programming is data movement between CPU and GPU memory
 it’s important to have data as clse to the procesing as possible
SYCL for OpenCL
what is OpenCL
- allows you to write kernels that execute on accelerators
- allows you to copy data between the host CPU and accelerators
- supports a wide range of devices
- comes in two components
 Host side C API for en-queueing kernels and copying Data
 Device side OpenCL C language for writing kernels
Motivation of SYCL
- make heterogeneous programming more accessible
 provide a foundation for efficient and portable templeate algorithms
- create a C++ for OpenCL ecosystem
 define an open portable standard
 provide the performance and portability of OpenCL
 base only on standard C++
- provide a high-level shared source model
 provide a high-level abstraction over OpenCL boiler plate Code
 allow C++ template libraries to target OpenCL
 allow type safety across host and device


how does shared source work? Regular C++ (Single Source)
how does shared source work? OpenCL (separate Source)
how does shared source work? SYCL (shared Source)


Suported Subset of C++ in Device Code
Suported Features
- static polymorphism
- lambdas
- classes
- operator overloading
- templeates
- placement new
Non-Suported Features
- dynamic polymorphism
- dynamic allocation
- exception handling
- RTTI
- static variables
- function pointers

SYCL example
Vector add








 
         
              