Intro to NVIDIA CUDA GPU programming

Learning NVIDIA CUDA programming

Description

GPU's are great for processing intensive 3d graphics. With hardware capable of such intensive computations it begs the question: what else can it calculate? Could a GPU be great as a general purpose computation device. To that end, we would need an extended gpu development platform.

GPUs traditionally only accept data one way. Data goes in and graphics come out as an image on the screen. You can not copy data back from the gpu. It is hardly of interest for strictly graphic image generation needs. So they were not designed with the capabilities of viewing the data. Getting data back would make the gpu of interest for new alternative purposes. With CUDA you can do just that.

GPUs perform highly intensive computations. They work on large data sets. They work on each and every pixel of an image one at a time. They repeat these computations as fast as video frame rates. They successfully complete such highly intensive computations in good enough time to play video games by processing in parallel.

While CPUs are often equiped with multiple cores even dozens, GPUs are equiped with thousands. With massive parallel processing power the GPU has over shadowed the CPU in the microchip market. While parallel programming has been available for CPUs for many years now using fork() or pthread() it is complicated and challenging. Parallel programming for CUDA enabled GPUs is very simple.

NVIDIA has brought about an age of supercomputing. You can expect major developments in the world as we know it in the very near coming future. It is not just a branding gimmic if you assumed as much. Indeed, NVIDIA has raised the bar in the electronic computation field. NVIDIA's new age of GPUs have ushered in a new wave of technological developments namely, AI, Machine Learning, and Big Data.

Audience

You own an NVIDIA card that includes CUDA and have basic C programming skills.

Debian Linux Installation

Install the driver and the software tools. They provide their own compiler. Just like compiling in C using GCC or another, you would do the same using NVIDIA's compiler namely NVCC. There is a helpful tool for installing nvidia drivers.

# apt install nvidia-detect

$ nvidia-detect

Checking card: NVIDIA Corporation GP108 [GeForce GT 1030] (rev a1)

Any driver will work with this card. It is recommended to install the nvidia-driver package.

There is a proprietary driver available for Debian in the repos but it did not work.

# apt install firmware-nvidia-graphics

New kernel installed you will need to reboot

# reboot

Failed to boot into X.org

There should be a driver available directly from NVIDIA's website that is worth a try. Here we installed the recommended free and open source NVIDIA driver.

# apt install nvidia-driver

New kernel installed you will need to reboot

# reboot

With your driver installed and rebooted you may be interested in viewing the settings:

$ nvidia-settings

Next we need the software development tools.

# apt install nvidia-cuda-toolkit

Getting started with coding

NVIDIA has published an intro blog article on their website: developer.nvidia.com/blog/even-easier-introduction-cuda

I got started using that article. This work is based on that.

Intro to cuda

So you have your big data set. For example you have a huge array of xaxis and yaxis values. You want to add them up.

void addArray( int lim , float xaxis[], float yaxis[], float sumxy[]) {

for ( int idx = 0 ; idx < lim ; idx ++ ) {

sumxy [ idx ] = xaxis [ idx ] + yaxis [ idx ]

}

}

main ( ) {

int lim = 1 000 000 ; // 1 million

float xaxis[lim], yaxis[lim] , sumxy[lim] ;

// ... some code that fills data

// now you want to add them up

addArray( lim , xaxis , yaxis , zaxis ) ;

// you'll probably want to call this function many times

}

To run addArray on the GPU using CUDA we just make a couple of changes:

1. Prefix the declaration with a new CUDA special "__global__" keyword

2. Prefix the function invovation with a new CUDA specific left and right caret syntax that takes the number of blocks and threads.

As with any gpu programming environment. We have to copy data back and forth between cpu ram and gpu ram. CUDA provides a very simple way to do it with a single command. In C we would allocate using the [] array operator or the malloc() function provided by the stdlib.h header. CUDA provides the cudaMallocManaged() function.

When programming the gpu even on a single core or thread, we are still programming in a multi-processor environment because we have at least two; the cpu and gpu. With any multi-processing programming we have race condition challenges. The primary concern is ensuring the main() program does not exit before all other processing. CUDA provides cudaDeviceSynchronize(). This will essentially wait for the gpu to finish processing and copy the data back from the gpu ram to the cpu ram. Call it after your gpu kernel invocation; lastly call cudaDeviceSynchronize().

There are a couple other simples differences after that:

1. Change to file name extension from ".c" to ".cu".

4. Compile it using nvcc instead of gcc

That is all you need for getting started. You can forget about the rest to start such as the thread, block, grid automatic vaiables. Run it on a single thread, then 2, 10, 100, 1000, 5000. Enjoy!

The program listing is available on my github linked here at the bottom of the page.

Demo

youtu.be/z_vSwb7Bx24

Final Notes - Performance

You can see for my case, 3000 threads was the optimal number. Also that for smaller data sets and repeated execution loop counts, the CPU is faster than the GPU, but with larger sizes of data and number of invocations the GPU far out performs the CPU demonstrating it's tremendous power.

We harnessed threads on a single block and grid. There is much more power than that. You can harness multiple blocks of threads and multiple blocks on a grid. To take advantage of blocks and grids you will need to learn about threadIdx, blockIdx and blockDim kernel variables. In my case, they did not work. I think the free driver does not fully support CUDA. Threads worked fine. You do not need to use threadIdx.x because the first block of threads would definitely start at zero anyway.