CUDA-Programming Inferface

CUDA C++ provides a simple path for users familiar with the C++ programming language to easily write programs for execution by the device.

It consists of a minimal set of extensions to the C++ language and a runtime library.

The core language extensions have been introduced in Programming Model. They allow programmers to define a kernel as a C++ function and use some new syntax to specify the grid and block dimension each time the function is called. A complete description of all extensions can be found in C++ Language Extensions. Any source file that contains some of these extensions must be compiled with nvcc as outlined in Compilation with NVCC.

Five parts are included.

  • Compilation with NVCC
  • CUDA Runtime
  • External Resource Interoperability
  • Versioning and Compatibility
  • compute Modes
  • Mode Switches
  • Tesla Compute Cluster Mode for windows

CUDA-Programming Model

This chapter introduces the main concepts behind the CUDA programming model by outlining how they are exposed in C++. An extensive description of CUDA C++ is given in Programming Interface.
Full code for the vector addition example used in this chapter and the next can be found in the vectorAdd CUDA sample.
Five parts are included.

  • Kernels
  • Thread hierarchy
  • Memory hierarchy
  • Heterogeneous Programming
  • compute capability

CUDA-Introduction

This series is about CUDA C programming guide. The post is chapter one.
Four parts are included.

  • from graphics processing to general purpose parallel computing
  • CUDA - A General-Purpose Parallel Computing Platform and Programming Model
  • A scalable programming Model
  • document structure

在SYCL device调用DNNL library纯C++文件实现

之前遇到的DNNl library加速库的UT无论是CPU还是GPU都是以benchdnn的方式进行结果正确性测试,在深度学习框架调用dnnl相应API。但是有些结果的数值正确性却无法使用benchdnn完全脱离框架复现,因此这里以batch norm op的forward和backward为例,以纯c++的方式完成UT的书写。

cuda stream and event

一般来说,cuda c并行性表现在下面两个层面上:

  • Kernel level
  • Grid level

到目前为止,我们讨论的一直是kernel level的,也就是一个kernel或者一个task由许多thread并行的执行在GPU上。Stream的概念是相对于后者来说的,Grid level是指多个kernel在一个device上同时执行。

how to use tensoriterator

This example using existing Linear Interpolation (aka lerp) operator, but same guidelines apply for other operators (new and existing ones).

As all changes going to impact performance significantly, we can use the simplest benchmark to measure operator speed before updates and establish the baseline.

如何评价清华大学发布的自研深度学习框架-计图(Jittor)?

2020年3月20日,清华自研的深度学习框架,正式对外开源。清华大学计算机系的图形实验室出品,取名Jittor,中文名计图。
计图(Jittor):一个完全基于动态编译(Just-in-time),内部使用创新的元算子和统一计算图的深度学习框架, 元算子和Numpy一样易于使用,并且超越Numpy能够实现更复杂更高效的操作。而统一计算图则是融合了静态计算图和动态计算图的诸多优点,在易于使用的同时,提供高性能的优化。基于元算子开发的深度学习模型,可以被计图实时的自动优化并且运行在指定的硬件上,如CPU,GPU。
官网链接: https://cg.cs.tsinghua.edu.cn/jittor/
github地址: https://github.com/Jittor/jittor

Understanding Conda and Pip

Conda and pip are often considered as being nearly identical. Although some of the functionality of these two tools overlap, they were designed and should be used for different purposes. Pip is the Python Packaging Authority’s recommended tool for installing packages from the Python Package Index, PyPI. Pip installs Python software packaged as wheels or source distributions. The latter may require that the system have compatible compilers, and possibly libraries, installed before invoking pip to succeed.

AI+的应用场景

要说现在什么最火,那都不用说,肯定是AI。AI已经渗入到了我们生活的方方面面。除了大家熟知的自动驾驶汽车、图像美颜,聊天机器人等,还有许多方面都应用到了AI,今天我就和大家聊下AI当前在各大领域的应用。