Results Over Coffee: 2011

March 4, 2011

Snake Oil 3: Decoupling Array and Thread Sizes, With Bonus Puppy

Prior to my unexplained absence from blogging, our last CUDA example demonstrated some basic decoupling of thread and array sizes. Specifically, our kernel checked to see if it had a thread index greater than the length of our input data and if so did not do any computation. That approach works fine if you have more threads than data elements in your array, but doesn't cover the more likely case of having more input data than threads. We'll fix that today.

As usual, first the code:

/*
 * snake_oil3.cu
 *
 * CUDA program to do bad encryption
 * 
 * compile with:  nvcc -o snake_oil3 snake_oil3.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <cstring>
#include <new>
#include <cstdlib>

using namespace std;

// function prototype for our CUDA kernel
__global__ void badcrypto(char *, char *, char *, int, int);


int main(int argc,char *argv[]) {
    //exit if we are not called correctly
    if (argc != 2) {
        cerr << "Usage: "<< argv[0] << " password" << endl;
        exit(EXIT_FAILURE);
    }

    //declare and allocate input and output arrays on host
    const int max = 128*1024*1024;  //128MB
    char *h_in = new (nothrow)char[max];
    if (h_in == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }
    char *h_out = new (nothrow)char[max];
    if (h_out == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }

    //read cleartext from stdin
    cin.read(h_in,max);
    int size = cin.gcount();
    
    //declare, allocate, and copy the cleartext to the device
    char *d_in;
    if (cudaMalloc(&d_in,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_in,h_in,size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //declare, allocate, and copy the key to the device
    size_t keysize = strlen(argv[1]);
    char *d_key;
    if (cudaMalloc(&d_key,keysize) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_key,argv[1],keysize,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

    //declare, allocate and zero the output array on device
    char *d_out;
    if (cudaMalloc(&d_out,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemset(d_out,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

        
    //launch kernel on device to generate results
    dim3 grid(1,1);
    dim3 block(256);    
    badcrypto<<<grid,block>>>(d_in,d_out,d_key,size,keysize);


    //copy the device buffer back to the host
    if (cudaMemcpy(h_out,d_out,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

    //write ciphertext to stdout
    cout.write(h_out,size);

    //deallocate arrays
    cudaFree(d_in);
    cudaFree(d_key);
    cudaFree(d_out);
    delete[] h_in;
    delete[] h_out;

    //bye
    exit(EXIT_SUCCESS);
}


/*
 * badcrypto
 *
 * kernel that "encrypts" data using xor.  Don't do this.
 */
__global__ void badcrypto(char *in, char *out, char *key,
                          int in_length, int key_length) {
    int index = threadIdx.x;
    while (index < in_length) {
        out[index] = in[index] ^ key[index % key_length];
        index += blockDim.x;
    }
}

I've done some minor restructuring of the previous example code. The allocation of input and output arrays on the host are now done together, and we've increased the maximum size to 128MB. We've also made this a fully functional example -- the password is supplied as an argument on the command line, and the ciphertext is now written to standard output rather than printed as integers. As a result, the program can now be used with the standard shell redirection operators like this:

./snake_oil3 password < input.txt > output.txt

The key difference is how we have re-written the kernel. Rather than just testing to see if our current threadIdx.x is too large, we are using a while loop that tests to see if our index is less than the length of the input array. At the end of each loop we increment our index by blockDim.x. As a result, our threads will "walk" though the array until all the input data has been encrypted. So even though we a launching a fixed number of threads (256) our input array can be many times larger (up to 128MB in my arbitrary example here).

To prove this, I downloaded a copy of War and Peace from Project Gutenberg. The UTF-8 version weighs in at a little over 3MB:

$ wc war_and_peace.txt
65335 565450 3288707 war_and_peace.txt

Using our new version of snake_oil, I can "encrypt" this file and put the output in the file ciphertext.xor.

$./snake_oil3 secretpassword < war_and_peace.txt > ciphertext.xor

You can examine the ciphertext.xor file to verify that it is "encrypted" or at least different from the input text. When you're ready, you can then decrypt it by feeding it back into snake_oil using the same password that was used to "encrypt" it ("secretpassword" in this example):

$./snake_oil3 secretpassword < ciphertext.xor > cleartext.txt

As a final check, we can verify that the decrypted text matches the original exactly:

$ md5sum war_and_peace.txt cleartext.txt
23755b6d1871a58160c36485455fa6fd war_and_peace.txt
23755b6d1871a58160c36485455fa6fd cleartext.txt

So, we've now written a fully-functional program that does very insecure encryption using CUDA-capable GPUs. In the next few weeks we'll look at ways to make this program more efficient -- we have a long way to go.

Finally, a note about my absence: new puppy. Photo is below.

February 18, 2011

Snake Oil 2: Decoupling Threads and Array Size

Jumping off from last week's snake oil encryption example, this week we will focus on making our example CUDA C program a little more user-friendly. We will also demonstrate one method for decoupling the number of threads from the amount of work. As usual, first the code:

/*
 * snake_oil2.cu
 *
 * CUDA program to do bad encryption
 * 
 * compile with:  nvcc -o snake_oil2 snake_oil2.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <string>
#include <new>
#include <cstdlib>

using namespace std;

// function prototype for our CUDA kernel
__global__ void badcrypto(char *, char *, char *, int, int);


int main(int argc, char *argv[]) {
    string key("password");
    int keysize = key.size()+ 1;

    //read from stdin to a buffer (max of 512 bytes)
    const int max = 512;
    char *h_in = new (nothrow)char[max];
    if (h_in == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }   
    cin.read(h_in,max);
    int size = cin.gcount();
    
    //declare, allocate, and copy the input to the device
    char *d_in;
    if (cudaMalloc(&d_in,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_in,h_in,size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //declare, allocate, and copy the key to the device 
    char *d_key;
    if (cudaMalloc(&d_key,keysize) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_key,key.c_str(),keysize,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    // declare, allocate and zero output array on device
    char *d_out;
    if (cudaMalloc(&d_out,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemset(d_out,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //launch kernel on device to generate results
    dim3 grid(1,1);
    dim3 block(512);    
    badcrypto<<<grid,block>>>(d_in,d_out,d_key,size,keysize);

    //declare and allocate an output array on the host
    char *h_out = new (nothrow)char[size];
    if (h_out == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }   

    //copy the device buffer back to the host
    if (cudaMemcpy(h_out,d_out,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

    //print results
    for (int i=0;i<size;i++) {
        cout << (int)h_out[i] << " " ;
    }
    cout << endl;

    //deallocate arrays
    cudaFree(d_in);
    cudaFree(d_key);
    cudaFree(d_out);
    delete[] h_in;
    delete[] h_out;

    //bye
    exit(EXIT_SUCCESS);
}


/*
 * badcrypto
 *
 * kernel that "encrypts" data using xor.  Don't do this.
 */
__global__ void badcrypto(char *in, char *out, char *key,
                          int in_length, int key_length) {
    if (threadIdx.x < in_length)
        out[threadIdx.x] = in[threadIdx.x] ^ key[threadIdx.x % key_length];
}

First let's cover the minor modifications. Rather than using fixed input, we are now reading the cleartext from stdin to a host side buffer. For simplicity, we've limited the buffer to 512 bytes for the time being. After reading in the file, we set size to the number of bytes read. This modification will allow us to use snake_oil2 on the command line like this:

$ echo -n cleartxt | ./snake_oil2

But note that the output is still printed as integers for now.

The other modifications are in the CUDA C kernel. Now that we are accepting input from stdin, we can no longer assume that the cleartext and the key are the same length. To handle this, we are passing in the length of the key, and using the modulus operator (%) to find the index into the key that is appropriate for the current position in the cleartext. The result is that our cleartext and key can be different lengths -- the key is simply repeated encrypt the cleartext.

The more important change is how we have dealt with the variable length of the cleartext. Rather than launching one CUDA thread per input character, we are launching a fixed number of CUDA threads (512), but predicating the encryption operation. We pass in the length of the input cleartext, and each thread checks if it has a thread index less than that length. Threads with an index that would fall beyond the end of the cleartext fail the if statement, and thus do not perform any operation. This decouples the thread count from the array size.

The decoupling is possible because, unlike SIMD (single instruction, multiple data) vector paradigms, CUDA C is SIMT (single instruction, multiple threads). These threads have some autonomy and (within some practical limits) can support diverging code paths. In this case, we have predicated the encryption operation to make the threadblock behave like an unrolled while() loop.

Obviously, this example is still pretty unusable. We're limited to a maximum of 512 bytes of input. We also have many threads that are not contributing to the calculation, and we're only using 32 CUDA cores. Unfortunately, my 20 minutes is up, so I'll have to address those issues next week.

February 16, 2011

CUDA_VISIBLE_DEVICES

On systems with more than one GPU, it's useful to be able to select which device(s) you want to use for running CUDA apps. The CUDA APIs will select a GPU as the default, so unless you specify differently, all your CUDA applications will run on the same GPU. Setting compute-exclusive mode doesn't change this behavior -- all the programs will still target the same default GPU, but at least the additional ones will fail quickly rather than consuming resources that might be required by the first program.

One solution is to use $CUDA_VISIBLE_DEVICES. The environment variable CUDA_VISIBLE_DEVICES lists which devices are visible as a comma-separated string. For example, I've equipped my desktop with two Tesla cards and a Quadro card. I can use the deviceQuery program from the CUDA SDK and a little grep magic to list them:

$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Tesla C1060"
Device 2: "Quadro FX 3800"

By setting the envar, I can make only a subset of them visible to the runtime:

$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./deviceQuery -noprompt | egrep "^Device"
Device 0: "Tesla C2050"
Device 1: "Quadro FX 3800"

Note that I didn't change the deviceQuery options at all, just the setting of $CUDA_VISIBLE_DEVICES. Also note that the GPUs are still enumerated sequentially from zero, but only the cards listed in the visible devices envar are exposed to the CUDA app.

This is useful in a couple situations. The first is the example I used above. My development workstation often has a mix of CUDA-capable devices in it. I generally need to target a specific model of card for testing, and the easiest way to do so is $CUDA_VISIBLE_DEVICES (especially with bash where it can be set per-invocation by prefixing the command).

The other case is clusters where nodes might be time-shared (multiple jobs running on the same node at the same time) but multiple GPUs on those nodes should be space-shared (one job per GPU). By setting $CUDA_VISIBLE_DEVICES in the prologue script, the batch system can route jobs to the right GPU without requiring the user to set additional command-line flags or configuration files. Of course, that requires the batch scheduler to support treating GPUs as independent consumables, but several common batch scheduling and resource management systems have already added that capability.

CUDA_VISIBLE_DEVICES is a simple feature, but it's a good technique to have in your pocket if you work on multiple GPU systems.

February 14, 2011

More Shameless Self-Aggrandizement

Quick note that a certain journal article has been covered by both International Science Grid This Week and HPCwire.

The HPCwire article is especially interesting as it calls us out for asking the tough questions. So be it -- HPC requires a balanced environment for productive science. You can't print a hundred terabyte dataset, nor can you build a hundred trillion pixel power wall. So if you need to build a machine that generates large datasets, you also need to plan for a post-processing environment capable of extracting the science. The exact mix of hardware, software, and skinny guys will depend on the site, but it's not something that good HPC sites ignore.

And yes, I realize that those questions are substantially less tough for me at a dot com, as compared to the co-authors who are at various dot edu and dot gov sites. Of course, that mix of employers also indicates that a lot of sites are thinking about this and "doing it right".

February 11, 2011

CUDA Project: Snake Oil Encryption

Now that we have a few simple CUDA programs under our belt, I'd like to tackle an example project for a few weeks. Ideally this is something that can scale to lots of cores, though I think it will be OK if it lacks the arithmetic intensity that most HPC apps have. I also want to do something a little silly....

I've been a fan of Bruce Schneier for well over a decade. That includes following his blog (and before that the Crypto-Gram newsletter), as well as reading many of his books. One theme that he has returned to again and again is that there are a lot of bad security products. In particular, one segment of Applied Cryptography always stands out in my mind:

The simple-XOR algorithm is really an embarrassment; it's nothing more than a Vigenère polyalphabetic cipher. It's here only because of its prevalence in commercial software packages.

-- Applied Cryptography, p14

And now we have our first CUDA project! Over the next few weeks, we'll use the massively parallel single-instruction multiple-thread streaming multiprocessors in the NVIDA GPU to implement one of the worst encryption algorithms since ROT13. With luck, this will provide the world with a fresh supply of snake oil products for Bruce to induct into the doghouse. I'm excited.

As usual, let's start with the code:

/*
 * snake_oil1.cu
 *
 * CUDA program to do bad encryption
 * 
 * compile with:  nvcc -o snake_oil1 snake_oil1.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <string>
#include <new>
#include <cstdlib>

using namespace std;

// function prototype for our CUDA kernel
__global__ void badcrypto(char *, char *, char *);


int main(int argc, char *argv[]) {
    string in("cleartxt");
    string key("password");
    int size = key.size();
    dim3 grid(1,1);
    dim3 block(size);

    //declare, allocate, and populate input & key on the device
    char *d_in;
    if (cudaMalloc(&d_in,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_in,in.c_str(),size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    char *d_key;
    if (cudaMalloc(&d_key,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemcpy(d_key,key.c_str(),size,cudaMemcpyHostToDevice)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    // declare, allocate and zero output array on device
    char *d_out;
    if (cudaMalloc(&d_out,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    if (cudaMemset(d_out, 0 ,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }
    
    //launch kernel on device to generate results
    badcrypto<<<grid,block>>>(d_in, d_out, d_key);

    //declare and allocate an output array on the host
    char *h_out = new (nothrow)char[size];
    if (h_out == 0) {
        cerr << "host allocation failed" << endl;
        exit(EXIT_FAILURE);
    }   

    //copy the device buffer back to the host
    if (cudaMemcpy(h_out,d_out,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(EXIT_FAILURE);
    }

    //print results
    for (int i=0;i<size;i++) {
        cout << (int)h_out[i] << " " ;
    }
    cout << endl;

    //deallocate arrays
    cudaFree(d_in);
    cudaFree(d_key);
    cudaFree(d_out);
    delete[] h_out;

    //bye
    exit(EXIT_SUCCESS);
}


/*
 * badcrypto
 *
 * kernel that "encrypts" data using xor.  Don't do this.
 */
__global__ void badcrypto(char *in, char *out, char *key) {
    out[threadIdx.x] = in[threadIdx.x] ^ key[threadIdx.x];
}

Wow, so much fail there. This should look vaguely like last week's exercise. We're defining two strings (the cleartex input in and the key) copying them to the device, doing a bitwise XOR, then copying the results back. Currently, this isn't going to be very useful to purveyors of bad software -- the strings are statically defined and the kernel is launched as a single thread block (limiting us to 512 threads on some GPUs). Additionally, the cleartext and key need to be the same length -- that's clearly not acceptable since without the ability to encrypt large amounts of data with a short key an ISV might accidentally implement a one-time-pad.

That's enough of a start. My plan for next few weeks is to first turn this into a usable app, then do some light optimization to better map it to GPU hardware. Given the lightweight nature of bitwise XOR, we probably won't see cosmic gains in performance, but it should be fun anyway.

February 9, 2011

Errata and What's in a Name

A little behind the scenes at ROC....

I really do write these entries over coffee. My self-imposed time limit for blogging is about 20 minutes, so I can usually queue up a couple short articles or part of a code example before I need to go back to earning a living. The short time limit insures that I can write throughout the week, but also means splitting the longer posts and code examples across writing sessions. I think this works OK for the topics I'm posting on -- with programming there's not any urgency to be current like a political blog would face (poor Andrew Sullivan).

As a result of the 20-and-out policy, my (lack of) proofreading may allow errors to slip through. For minor typos and thinkos, I will just make corrections to the published articles silently. If I need to make a major change to a code example, I'll make a note of it on the post for those who have already cut-and-pasted the broken version. If I somehow manage to post a non-working code example or tip, I will self-flagellate with a length of cat5 cable.

While I'm on this topic, the name Results Over Coffee was inspired by a giveaway. The 2010 GPU Technology Conference gave away coffee mugs that featured the phrase "Results over coffee, not overnite". Unlike most conference chotchkies, the GTC mugs are pretty nice (insulated ceramic), and I often use mine when working in my home office. I happened to be drinking from it on the same day I was figuring out what to call the blog I'd be writing 20min at a time. The coincidence seemed too good to ignore, so I borrowed part of the phrase for a blog title.

A trip to register.com later, here we are. Still on my things to do list is blogging some coffee reviews. As a preview, I'm doing this entry over a cup of Barefoot Coffee's Michicoy, which is absolutely wonderful out of a press pot. Life is good.

February 7, 2011

To the Cloud or Something

I cover both high performance computing and cloud computing for NVIDIA. I have to confess that, like most HPC people, I hated "cloud" as an adjective. It seemed to be a pointless renaming of previously existing technologies. I now use it several times a day.

In the absolute sense, there are some pretty comprehensive definitions of cloud computing. The various types of cloud computing are well-specified, as are the mix of technologies required. That combination of technologies differs from other models. Operating a service in Amazon's EC2 is significantly different that standing up an IMAP server at a small campus, even if both ultimately involve computers and networks.

It also means something in the relative sense. CTO's and CIO's understand what "moving a service to the cloud" refers to, even if they can't repeat the NIST or wikipedia definitions from memory. This is similar to "high performance computing", which means something specific to most of us in HPC, even if looks like "servers connected to a network" to non-HPC folks.

Now that that is out of the way, why not try it? Amazon's Elastic Compute Cloud is available to anyone with an Internet connection and a credit card. I was heavily involved in getting EC2 enabled with GPUs, and the resulting service works smoother than I could have imagined when the project began. It also provides nearly bare-metal performance.

I gave a talk on getting started with EC2 at Supercomputing 2010. The process is easy enough that anyone with 30min and the previously mentioned Internet connection and credit card should be able to do it. If you're too impatient to sit though the slide deck and/or video, the summary is:

Create an AWS Account at http://aws.amazon.com/
Sign into the EC2 console at http://aws.amazon.com/console
Create a Key Pair for ssh
Add SSH to a Security Group to open the firewall
Launch (instance) a GPU-ready image using cg1.4xlarge
Enjoy root access to a 1TF system

The final point is not a typo. The Amazon GPU offering is a system with two Tesla m2050 GPUs, each of which provides 515 gigaflops of double-precision computing performance. Currently, Amazon is charging $2.10/hour for a system that would have easily placed in the top 50 of the world's computer systems ten years ago. It's really an amazing amount of double-precision computing power, and I've already seen some interesting uses.

It's also a good technique to have in your back pocket if you're reading this blog to follow the CUDA programming examples. My test machines generally have one or two Tesla cards, which have many features that the GeForce consumer cards lack (higher double-precision floating point performance, dual DMA engines, ECC, more memory, etc). My examples will usually work on GeForce, but EC2 provides a way to get access to higher performance Tesla devices when it is required.

And you don't have to call it cloud computing if you don't want to.

February 4, 2011

CUDA Blocks and Grids

In our previous CUDA example, we wrote a kernel that did some math based on thread number. I was specifically vague on the details of the triple-chevron launch syntax, with a promise to cover it next time. Next time is today.

Recall that we launched our kernel with a line like this one:

kernel<<<1,n>>>(d_array);

where n was the number of threads that would be spawned on the device, along with a caution than values larger than 512 might cause problems. The good news is that we are certainly not limited to 512 threads, and the launch syntax has some features that make mapping threads to larger problems much easier. First the code:

/*
 * cuda_enum2.cu
 *
 * CUDA Kernel at enumerates by block and thread indices
 * 
 * compile with:  nvcc -o cuda_enum2 cuda_enum2.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <unistd.h>

using namespace std;

//function prototype for our CUDA kernel
__global__ void enumerate(int *);


int main(int argc, char *argv[]) {
    dim3 grid(5,5);
    dim3 block(3,3,3);
    int n = grid.x * grid.y * block.x * block.y * block.z;
    size_t size = n * sizeof(int);

    //declare, allocate, and zero an array on the device
    int *d_array;
    if (cudaMalloc(&d_array,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }
    if (cudaMemset(d_array,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //launch kernel on device to generate results
    enumerate<<<grid,block>>>(d_array);

    //declare and allocate an array on the host
    int *h_array = (int*)malloc(size);
    if (h_array == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }

    //copy the device buffer back to the host
    if (cudaMemcpy(h_array,d_array,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //print results
    for (int i=0;i<n;i++) {
        cout << h_array[i] << " ";
    }
    cout << endl;

    //deallocate arrays
    cudaFree(d_array);
    free(h_array);

    //bye
    exit(0);
}


/*
 * enumerate
 *
 * kernel that computes a unique thread number based on the grid and
 * block indices and stores it in the A[]
 */
__global__ void enumerate(int *A) {
    int value =
          blockIdx.y  * gridDim.x  * blockDim.z * blockDim.y * blockDim.x
        + blockIdx.x  * blockDim.z * blockDim.y * blockDim.x
        + threadIdx.z * blockDim.y * blockDim.x
        + threadIdx.y * blockDim.x
        + threadIdx.x;
    A[value] = value;
}

Again, this should look pretty similar to our previous CUDA example. We've renamed our kernel to something more descriptive and changed the launch syntax a little, we've also re-written the kernel.

First that launch syntax. Rather than a couple scalar numbers, we now are using vectors of type dim3. As you might suspect, the dim3 structure is a vector with three values. When we generate it using the constructor syntax, any values not specified are set to one. That means that "dim3 grid(5,5);" creates a vector with three vaules, (5,5,1).

Additionally, you can see that the launch syntax uses two arguments: blocks and grids. A thread block is a group of related threads that can support up to three dimensions. With Fermi, the maximum block size 1024 threads, and the maximum dimensions are 1024 x 1024 x 64. Previous GPUs have lower limits, so you might be restricted to only 512 threads and 512 x 512 x 64 if you have a GT200 based system. Threads in the same block are scheduled on the same streaming multiprocessors and can communicate with shared memory.

That many threads sounds like a lot, but isn't much for a GPU that has 448 cores. That's where grids come in. Multiple thread blocks can be launched from the same grid. With Fermi, grids are two-dimensional and can be up to 65535 x 65535. Since each grid square is a block of threads, the total number of threads can get large very quickly. Luckily GPUs have hardware schedulers and can easily handle many thousands of threads. That's important, because GPUs use threading to hide memory access latency rather that caching like a CPU does. So on GPU's you always want to launch more threads than cores.

In the code above, we are launching a 5 x 5 grid of 3 x 3 x 3 thread blocks. That's 675 threads being launched in parallel on the GPU. If you need a picuture:

The other thing we changed is the CUDA kernel itself. Our enumerate kernel calculates a thread number based on the block and grid dimensions and the thread's index. As a result the number is unique per thread, and we calculated it so that threads access the array in a coalesced fashion (neighboring threads in a block store to neighboring locations in the array).

My coffee is empty, so that's enough explanation for today.

February 2, 2011

I Hate RC Scripts

Having had roles at Unix/Linux sites for the better part of two decades, my experience is that users are amazingly proficient at catastrophically mangling their RC files (.profile, .chsrc, .login, .bashrc, and others). In fact, it's such a problem that I wrote an article on how to deploy environment modules to deal with it.

For sites that don't use modules, below are a pair of scripts that can be placed in /etc/profile.d on Linux. They will correctly configure environments for most shells so that CUDA C just works. They've been used at several sites under RHEL/CentOS, but probably work on other distros with little or no modifications.

# cuda.sh -- CUDA intialization script (sh/bash)
# resultsovercoffee@gmail.com

# edit this line and place script in /etc/profile.d
cudapath="/usr/local/cuda"

# set LD_LIBRARY_PATH
if  [[ ${LD_LIBRARY_PATH} != *${cudapath}/lib64* ]] 
then
  export LD_LIBRARY_PATH=${cudapath}/lib64:${cudapath}/lib:${LD_LIBRARY_PATH}
fi

# set PATH
if [[ ${PATH} != *${cudapath}/bin* ]]
then
  export PATH=${cudapath}/bin:${PATH}
fi

# cuda.csh -- CUDA intialization script (csh)
# resultsovercoffee@gmail.com

# edit this line and place script in /etc/profile.d
set cudapath="/usr/local/cuda"

# set LD_LIBRARY_PATH
if ( "${?LD_LIBRARY_PATH}" == "0" ) then
  setenv LD_LIBRARY_PATH ${cudapath}/lib64:${cudapath}/lib
else if ( "${LD_LIBRARY_PATH}"  !~ "*${cudapath}/lib64*" ) then
  setenv LD_LIBRARY_PATH ${cudapath}/lib64:${cudapath}/lib:${LD_LIBRARY_PATH}
endif

# set path
if ( "${path}" !~ "*${cudapath}/bin*" ) then
  set path = ( ${cudapath}/bin $path )
endif

You can use the time not spent on helpdesk calls to read a book on CUDA.

January 31, 2011

Local Boy Does Well

A quick note that the whitepaper version of an article I contributed to garnered mentions by Vizworld and InsideHPC last week.

Thanks to Wes for being leader and chief cat-herder -- getting all of us on a phone call is a herculean effort under normal circumstances, and it was even more challenging because I was in China working on Nebulae during the critical edits. It was an honor to be involved with a who's-who of HPC visualization.

January 30, 2011

Our First CUDA C Kernel

Our first CUDA program covered the basics of moving data to and from the GPU (aka the device). While a necessary skill, data movement by itself is not very compelling. We didn't use any of the GPU's processing power, and there are better ways to create a volatile ram disk.

For a second effort, lets do something simple: we'll demonstrate how to give each thread we launch on the GPU a unique ID and do some integer addition. First the code:

/*
 * cuda_enum.cu
 *
 * Simple CUDA Kernel that enumerates by thread index
 *
 * compile with:  nvcc -o cuda_enum cuda_enum.cu
 * 
 * resultsovercoffee@gmail.com
 */

#include <iostream>
#include <unistd.h>

using namespace std;

//function prototype for our CUDA kernel
__global__ void kernel(int *);


int main(int argc, char *argv[]) {
    int n = 16;
    size_t size = n * sizeof(int);

    //declare, allocate, and zero an array on the device
    int *d_array;
    if (cudaMalloc(&d_array,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }
    if (cudaMemset(d_array,0,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //launch kernel on device to generate results
    kernel<<<1,n>>>(d_array);

    //declare and allocate an array on the host
    int *h_array = (int*)malloc(size);
    if (h_array == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }

    //copy the device buffer back to the host
    if (cudaMemcpy(h_array,d_array,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //print results
    for (int i=0;i<n;i++) {
        cout << h_array[i] << endl;
    }

    //deallocate arrays
    cudaFree(d_array);
    free(h_array);

    //bye
    exit(0);
}


/*
 * kernel
 * 
 * CUDA kernel -- sets a value of A[] to the thread's X index + 1
 */
__global__ void kernel(int *A) {
    A[threadIdx.x] = 1 + threadIdx.x;
}

This should look a little similar to the previous example. We are still declaring and allocating memory on the device, but instead of character buffers, we are using arrays of integers. The variable n that is set in main is used to control the length of the arrays. Additionally, we are using a new host function cudaMemcpy() to set the device array to all zeros. This isn't strictly necessary for this example, but will be in the future.

The really new part is the kernel function. In CUDA C, we refer to the arithmetically dense function that runs on the device as a kernel, so our example kernel is eponymously named. The key part of the prototype (and later definition) is the __global__ type qualifier (note the double underscores prefix and postfix). __global__ functions are executed on the device, but must be called from the host and must return void.

Our kernel function is defined at the end of the code. It takes a pointer to an array A. The body of the function performs an addition to calculate A[threadIdx.x] + 1, then stores the result in A[threadIdx.x]. Notice that we didn't declare or set threadIdx.x -- it is a built-in variable in CUDA that is only valid when the code is executing on the device. In our example the first thread gets 0, the second gets 1, etc. So each thread gets a different index and thus operates on a different A. Because different threads can run on different CUDA cores, the calculations can occur in parallel on the GPU.

The final new thing is the launch of the kernel that occurs near the center of main(). Note that the syntax looks like a standard function call, but includes some additional arguments between triple chevron brackets, <<<1,n>>>. Without going into too much detail, the second argument is the number of threads our example will spawn on the device. Since we also use n to size our array, we will be spawning a thread for each value of A[]. Feel free to experiment with different values of n, but note that values greater than 512 may cause problems. More on that next time.

Congratulations, you've just written a program that performs parallel computation on the GPU using multiple threads executing on multiple cores. We'll come back to this example next week and dive into the triple chevron syntax in some more detail.

January 26, 2011

CUDA in Runlevel 3

Most Linux clusters operate at runlevel 3 instead of runlevel 5. The major difference is that runlevel 5 starts the Xdm graphical login system -- something that isn't very useful on cluster nodes which don't have a display attached. Configuring nodes for runlevel 3 eliminates the overhead of that unused X11 server, which is good for application performance. Unfortunately, the NVIDIA driver is usually loaded when X11 starts, so a node operating at runlevel 3 can't run CUDA applications.

The fix is using an init script to modprobe the NVIDIA driver and create the necessary /dev files. My script for doing that is below. In addition to loading the driver and creating the necessary device files, it also sets compute-exclusive mode on each card and loops nvidia-smi to keep a persistent connection for faster kernel launches.

#!/bin/bash
#
# /etc/init.d/cuda startup script for nvidia driver
# symlink from /etc/rc3.d/S80cuda in non xdm environments
#
# Creates devices, sets persistent and compute-exclusive mode
# Useful for compute nodes in runlevel 3 w/o X11 running
#
# chkconfig: 345 80 20

# Source function library
. /lib/lsb/init-functions

# Alias RHEL's success and failure functions
success() {
    log_success_msg $@
}
failure() {
    log_failure_msg $@
}

# Create /dev nodes
function createdevs() {
    # Count the number of NVIDIA controllers
    N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`

    # Create Devices, exit on failure
    while [ ${N} -gt 0 ] 
    do
      let N-=1
      /bin/mknod -m 666 /dev/nvidia${N} c 195 ${N} || exit $?
    done
    /bin/mknod -m 666 /dev/nvidiactl c 195 255 || exit $?
}

# Remove /dev nodes
function removedevs() {
    /bin/rm -f /dev/nvidia*
}

# Set compute-exclusive
function setcomputemode() {
    # Count the number of NVIDIA controllers
    N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`
    # Set Compute-exclustive mode, continue on failures
    while [ $N -gt 0 ]
    do
      let N-=1
      /usr/bin/nvidia-smi -c 1 -g ${N} > /dev/null
    done
}

# Start daemon
function start() {
   echo -n $"Loading nvidia kernel module: "
   /sbin/modprobe nvidia && success || { failure ; exit 1 ;}
   echo -n $"Creating CUDA /dev entries: "
   createdevs && success || { failure ; exit 1 ;}
   echo $"Setting CUDA compute-exclusive mode."
   setcomputemode
   echo $"Starting nvidia-smi for persistence."
   /usr/bin/nvidia-smi -l -i 60 > /dev/null &
}

# Stop daemon
function stop() {
   echo $"Killing nvidia-smi."
   /usr/bin/killall nvidia-smi
   echo -n $"Unloading nvidia kernel module: "
   sleep 1
   /sbin/rmmod -f nvidia && success || failure
   echo -n $"Removing CUDA /dev entries: "
   removedevs && success || failure
}

# See how we were called
case "$1" in
   start)
       start
      ;;
   stop)
       stop
      ;;
   restart)
       stop
       start
      ;;
   *)
       echo $"Usage: $0 {start|stop|restart}"
       exit 1
esac
exit 0

This script is fairly well-tested on RHEL/CentOS systems, but probably works on other distros with no or minor modifications.

January 24, 2011

Disruptive Technology

disrupt: to interrupt the normal course or unity of

- Merriam-Webster

With somewhat regular frequency, I get into discussions about whether GPUs are a "disruptive technology" in HPC (I hang out with nerds). Often this is in response to articles like this. As someone who now covers HPC for NVIDIA, but was working for LLNL through most of the last decade, I have some decidedly strong opinions on this topic.

The disruptive part is the upset in the top500 performance curve. If you plot the up-and-to-the-right performance of the top500 list, IBM BlueGene/L was above the curve as it existed at the time -- same outlay of cash/power/time, lots more performance. As the first massively parallel processing machine based on embedded processors, it was assumed that subsequent improvements in PPC (or other) embedded processors would provide a series of increasingly faster MPP systems. This has been mostly true, though the follow-on machines have not been large enough to capture the number 1 position.

Going back further in time, I don't think I ever heard the Earth Simulator referred to as disruptive technology. Though it captured the and held the "worlds fastest computer" title by a larger margin and for a longer time that BG/L would, most viewed it as the last gasp of vector technology. Distributed memory clusters like IBM SP had already won the mind-share at most sites, and though Earth Simulator was able to remain at #1 for over two years, it was pretty clear that it's dominance would come to an end. More vector machines were built, but HPC codes continued to move from vector to distributed memory.

Getting back to the question, with BlueGene/Q poised to re-take the top500 crown, are GPUs a temporary disturbance of embedded MPP march to dominance, or a permanent shift?

My answer is that disruption isn't a zero-sum game, and embedded MPP doesn't have to fail for GPUs to succeed. GPUs provide the "same slope, higher intercept" disruption of the top500 curve that embedded MPP did. Like embedded MPP, advances in the underlying technology will continue to provide GPU computing with performance improvements for years to come. And like embedded MPP, GPUs are leveraging a robust consumer market to achieve those advancements.

Where GPUs hold a significant advantage is barrier to entry. There are millions of CUDA-capable GPUs in systems today, and hundreds of universities teaching CUDA. Moreover, GPU clusters can be constructed entirely of common off-the-shelf hardware and software, putting the cost within reach of individual researchers and small teams. It's instructive to remember that Nebulae and Tianhe-1A were designed and built in months, while most of the embedded MPP designs have taken years to go from powerpoint to power-on.

There is room for both. In the upper range of the top500, embedded MPP will continue to leverage specialized networks and OS stacks to achieve performance through high node counts. At the same time, GPUs are already delivering systems with over 1TF of peak double-precision per node on everything from desktops to petaflop clusters. Both have been disruptive, and either could grow to dominate HPC in years to come. I've adjusted my career path accordingly for the next five years or so.

January 21, 2011

Our First CUDA C Program

Ok, time to come clean. One reason I started this blog was that I don't do enough recreational programming. You'd think that I'd get most of that out of my system with the amount of CUDA and Tesla related tasks that constitute my normal workday, but you'd be wrong. So, I'm going to use this blog to keep myself honest by posting regular CUDA C programs, tips, and code segments.

To make this a little more interesting, I'm going to assume that we're all starting with zero knowledge of CUDA C. So I'll begin with simple programs that illustrate the basics, and slowly work up to more complicated concepts and features. As I do so I'll try to throw in a little Socratic method and refine things based on the deficiencies of the early examples. Or you could buy a book.

For the first program, let's do something simple. The CUDA model exposes the GPU as a an accelerator or co-processor that has it's own memory space. So the first task a GPU programmer faces is learning how to allocate space on the device and move data back and forth between host memory and device memory. I'll show the full program first, and then break down the interesting bits.

/*
 * hello_cuda.cu
 *
 * Our first cuda program, resultsovercoffee@gmail.com
 *
 * compile with:  nvcc -o hello_cuda hello_cuda.cu
 */

using namespace std;

#include <iostream>
#include <string.h>
#include <unistd.h>

int main (int argc, const char *argv[]) {

    //our message
    const char *message = "hello world!";
    size_t size = strlen(message)+1;
     
    //delcare and allocate a buffer on the device
    char *d_buffer;
    if (cudaMalloc(&d_buffer,size) != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //copy our message to the device buffer
    if (cudaMemcpy(d_buffer,message,size,cudaMemcpyHostToDevice)
        != cudaSuccess){
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    //declare and allocate a buffer on the host
    char *h_buffer = (char*)malloc(size);
    if (h_buffer == 0){
        cerr << "malloc failed" << endl;
        exit(1);
    }
    
    //copy the device buffer back to the host
    if (cudaMemcpy(h_buffer,d_buffer,size,cudaMemcpyDeviceToHost)
        != cudaSuccess) {
        cerr << cudaGetErrorString(cudaGetLastError()) << endl;
        exit(1);
    }

    cout << h_buffer << endl;
    cudaFree(d_buffer);
    free(h_buffer);
}

The first thing many of you will notice is that I'm using C++ (or at least some C++). If nothing else the using namespace std; probably gives it away. In this case I stuck with C allocation and char* arrays rather than strings, but I will generally be using C++ features when they are available.

The first CUDAism is the cudaMalloc() call that used to allocate space on the device. The syntax is slightly different than regular host-side malloc(), in that cudaMalloc takes a pointer-to-a-pointer and changes the value directly, while regular malloc provides the address of the allocated memory in it the return code. As a result, the cudaMalloc call can use the return code to indicate success or failure. Notice that we check that by comparing it to the value cudaSuccess and if it fails we make a call to get the specific error and print it to standard error. [If you'd like to see this in action, try increasing size to a number larger than the amount of memory on your NVIDIA GPU.]

The other CUDA C function we are using is cudaMemcpy(). That function takes four arguements, a pointer to the destination, a pointer to the source, the amount of memory to copy, and a direction. In this example, we first copy our message to a buffer on the device, then we copy it back to a new buffer on the host. Also note that I'm prefixing my pointers with h and d to indicate wether they point to memory in the host or the device. This isn't required, but it helps prevent you from shooting yourself in the foot later.

Finally, we print the result to standard output so that we can visually see that it survived intact and deallocate that which we allocated. That's it. If you've followed this far, you've now got a CUDA C program under your belt.

January 17, 2011

First Post: 2010 in Review

My name is Dale Southard. I'm a senior solution architect with NVIDIA, where I cover primarily high-performance computing and cloud computing. Results Over Coffee (ROC) is my blog, and I'll be posting on HPC and Cloud as well as some how-to information on programming GPUs, integrating them in Linux clusters, and whatever else I can cover over a cup of coffee. When I'm posting here, I'm speaking for me, not NVIDIA (even if I am posting about NVIDIA hardware or software).

I've often compared my work at NVIDIA to being a professional debugger -- I spend a lot of time working with very new hardware and software in new and complicated deployments. It's a lot of work, a lot of fun, and I feel lucky to be doing it for NVIDIA. In 2010, my personal high points were:

Spent three weeks in China for debug/tuning the top500 runs on Nebulae
Gave a talk at GTC on the above experience
theRegister wrote a mostly-accurate article on above talk
Spent a lot of time getting GPUs working in the Amazon EC2 cloud
Spent a few days doing debug/tuning for the top500 runs on Keeneland
Gave a presentation on GPU Cloud Computing at SuperComputing
Had an interview with HPCintheCloud on GPUs and HPC (on cell phone, in traffic)
Co-wrote an article on HPC visualization for IEEE with a lot of smart people
And a bunch of other things that I can't talk about yet....

I have long viewed GPU Computing as a disruptive technology for high-performance computing and I am excited to have a ringside while it unfolds. ROC will be my place to share thoughts, tips, and hopefully have some fun.

So, welcome to ROC. There's more to come...