SpeedGo Computing: programming

Showing posts with label programming. Show all posts

Tuesday, May 3, 2011

Web Seminar: Programming GPUs Beyond CUDA

GPU/CUDA programming is easy if we ignore the performance, or even the correctness of the program. It becomes tough when the performance is critical, one has to optimize very hard on the specific hardware. Fortunately, GPU hardware performance improves drastically every 2 years. Unfortunately, the performance is not portable across different generations of GPUs.

Prof Chen from Tshing Hua University is proposing MapCG, a MapReduce framework as a resolution to the portability problem.

Check out the details of the seminar in the following link:

http://www.gpucomputing.net/?q=node/5277

Saturday, April 30, 2011

First Release of SGC Ruby CUDA - Beginning of a long way path

Today we decided to put up the first release of the SGC Ruby CUDA v0.1.0 as a mean to attract Rubyists to try out GPU programming as their new toy projects, and also to encourage HPC developers to evaluate if Ruby is good to use for their HPC applications.

When important software libraries are not available in Ruby, we certainly do not expect much Ruby usage in the area. As time is running short, more and more hardware is piling up underutilized, we are urged to take the first fundamental step moving Ruby programming towards HPC applications by making important SDK such as Nvidia CUDA SDK available to a Ruby program.

Rubyists who are new to GPU programming can now access CUDA GPUs easily to harness the massively parallel architecture of GPUs. On the other spectrum, HPC developers now have a choice to manage their complex applications by large portion in Ruby, while retaining only relatively small section of codes in C/C++/CUDA C etc.

We believe Ruby programming could improve productivity and maintainability tremendously since in many cases, heavy computation only happens in small section of codes, and Ruby programming simplify the software architecture and implementation significantly. Even when the performance is extremely critical that one must port everything back from Ruby to C/C++/CUDA C for highest performance, one has already saved tremendous effort in software architecture and design to achieve manageable design, extendable, ease of use, etc. The porting back to C/C++/CUDA C becomes much more straightforward as one has gained much knowledge about the domain.

Compared to developing a complex application from scratch in C/C++/CUDA C, one has to go through unforeseeable curvy path to achieve the same state which is bound to very high failure rate. Hence, we believe that this could set the start of Ruby programming towards HPC applications.

SGC Ruby CUDA has been updated significantly since the last post about it. As we have packaged it into a Ruby gem, you can now install it with

gem install sgc-ruby-cuda

The code repository is hosted at github, SGC Ruby CUDA.
The documentations are available at rubydoc.info, SGC Ruby CUDA Doc.
Feel free to join the discussion group/mailing list at SGC Ruby CUDA Google Group.

Sunday, September 26, 2010

Parallel programming knowledge is must-have skill for Wall Street

Friday, September 17, 2010

Unigine crew: CUDA vs OpenCL vs SPU Part IV

Which language or library you choose to use for your software development has great and prolong impact to the software. Just come across a simple yet interesting benchmark. Perhaps, more details on why such numbers are obtained would be even more enlightening.

Unigine crew: CUDA vs OpenCL vs SPU Part IV

CUDA Programming with Ruby

Need GPU computing power in your Ruby program? Great! SpeedGo Computing is developing Ruby bindings for CUDA, called sgc-ruby-cuda. Take advantage of your Nvidia CUDA-enabled graphics cards with Ruby now.

Currently, only part of the CUDA Driver API is included. More components such as the CUDA Runtime API will be included to make it as complete as possible.

CUDA Programming with Ruby


require 'rubycu'

include SGC::CU

SIZE = 10 
c = CUContext.new

d = CUDevice.get(0)   # Get the first device.
c.create(0, d)    # Use this device in this CUDA context.

m = CUModule.new
m.load("vadd.ptx")    # 'nvcc -ptx vadd.cu'
                      # vadd.cu is a CUDA kernel program.

da = CUDevicePtr.new    # Pointer to device memory.
db = CUDevicePtr.new
dc = CUDevicePtr.new

da.mem_alloc(4*SIZE)    # Each Int32 is 4 bytes.
db.mem_alloc(4*SIZE)    # Allocate device memory.
dc.mem_alloc(4*SIZE)

ha = Int32Buffer.new(SIZE)    # Allocate host memory.
hb = Int32Buffer.new(SIZE)
hc = Int32Buffer.new(SIZE)
hd = Int32Buffer.new(SIZE)

(0...SIZE).each { |i| ha[i] = i }
(0...SIZE).each { |i| hb[i] = 2 }
(0...SIZE).each { |i| hc[i] = ha[i] + hb[i] }
(0...SIZE).each { |i| hd[i] = 0 }

memcpy_htod(da, ha, 4*SIZE)  # Transfer inputs to device.
memcpy_htod(db, hb, 4*SIZE)

f = m.get_function("vadd");
f.set_param(da, db, dc, SIZE)
f.set_block_shape(SIZE)
f.launch_grid(1)  # Execute kernel program in the device.

memcpy_dtoh(hd, dc, 4*SIZE) # Transfer outputs to host.

puts "A\tB\tCPU\tGPU"
(0...SIZE).each { |i| 
    puts "#{ ha[i]}\t#{hb[i]}\t#{hc[i]}\t#{hd[i] }" 
}

da.mem_free    # Free device memory.
db.mem_free
dc.mem_free

c.detach    # Release context.


/* vadd.cu */
extern "C" {
    __global__ void vadd(const int* a,
                         const int* b,
                         int* c,
                         int n)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n)
            c[i] = a[i] + b[i];
    }
}

Although the kernel program still need to be written in CUDA C, this Ruby bindings have provided first bridging step towards Ruby GPU computing.

How to execute?


$ ruby extconf.rb
checking for main() in -lcuda... yes
creating Makefile
$ make
...
g++ -shared -o rubycu.so rubycu.o ...
$ nvcc -ptx vadd.cu
$ ruby -I . test.rb
A       B       CPU     GPU
0       2       2       2
1       2       3       3
2       2       4       4
3       2       5       5
4       2       6       6
5       2       7       7
6       2       8       8
7       2       9       9
8       2       10      10
9       2       11      11

Cool! The summation of two vectors is performed in the GPU.

See also:

Tuesday, August 17, 2010

Parallelizing Matrix Multiplication using MPI

MPI is a popular mechanism in high performance computing. It works for both cluster and shared memory environment. Why don't we simply use MPI when it works for both environments? Why do we care about OpenMP? Cilk++? etc. Perhaps that depends on the complexity of the applications you are dealing with.

Parallel Matrix Multiplication using MPI

/* matrix-mpi.cpp */
#include <mpi.h>

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

void multiply(int istart, int iend)
{
    for (int i = istart; i <= iend; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }
}

int main(int argc, char* argv[])
{
    int rank, nproc;
    int istart, iend;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) {
        // Initialize buffers.
        for (int i = 0; i < size; ++i) {
            for (int j = 0; j < size; ++j) {
                a[i][j] = (float)i + j;
                b[i][j] = (float)i - j;
                c[i][j] = 0.0f;
            }
        }
    }

    // Broadcast matrices to all workers.
    MPI_Bcast(a, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
    MPI_Bcast(b, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);
    MPI_Bcast(c, size*size, MPI_FLOAT, 0,MPI_COMM_WORLD);

    // Partition work by i-for-loop.
    istart = (size / nproc) * rank;
    iend = (size / nproc) * (rank + 1) - 1;

    // Compute matrix multiplication in [istart,iend]
    // of i-for-loop.
    // C <- C + A x B
    multiply(istart, iend);

    // Gather computed results.
    MPI_Gather(c + (size/nproc*rank),
               size*size/nproc,
               MPI_FLOAT,
               c + (size/nproc*rank),
               size*size/nproc,
               MPI_FLOAT,
               0,
               MPI_COMM_WORLD);

    if (rank == 0) {
        // Compute remaining multiplications
        // when size % nproc > 0.
        if (size % nproc > 0) {
            multiply((size/nproc)*nproc, size-1);
        }
    }

    MPI_Finalize();
    return 0;
}


$ g++ -O2 matrix.cpp -o matrix
$ mpicxx -O2 matrix-mpi.cpp -o matrix-mpi
$ time ./matrix
real 0m13.226s
user 0m12.529s
sys 0m0.065s
$ time mpirun -np 2 ./matrix-mpi
real 0m8.490s
user 0m6.346s
sys 0m0.178s

Phew .... what a hassle ... you can see the needs to:

perform data transfer to workers manually
perform work partitioning manually
perform many index calculations
handle remaining work when the amount of work is not divisible by the number of workers.

Furthermore, this MPI version uses more memory than the shared memory counterparts. The MPI program is launched with multiple processes as multiple workers, hence the memory consumption also multiply up. More work would be required to minimize the total memory consumption.When you must work with cluster environment, perhaps you don't have many choices with the current state of art programming tools.

Sunday, August 15, 2010

Parallelizing Matrix Multiplication using TBB

Parallelizing matrix multiplication using TBB isn't too difficult. It's just a little more work than OpenMP or Cilk++.

Parallel Matrix Multiplication using TBB

/* matrix-tbb.cpp */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

using namespace tbb;

const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];


class Multiply
{
public:
    void operator()(blocked_range<int> r) const {
        for (int i = r.begin(); i != r.end(); ++i) {
            for (int j = 0; j < size; ++j) {
                for (int k = 0; k < size; ++k) {
                    c[i][j] += a[i][k] * b[k][j];
                }
            }
        }
    }
};


int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    parallel_for(blocked_range<int>(0,size), Multiply());

    return 0;
}

We've moved the computation of the matrix multiplication into the class Multiply which takes in the range of i-iterations to work on. The parallel_for internally split the range [0,size) into multiple blocks. Multiple workers can then work on different non-overlapping blocks in parallel.


$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 matrix-tbb.cpp -ltbb -o matrix-tbb
$ time ./matrix
real 0m12.971s
user 0m12.489s
sys 0m0.052s
$ time ./matrix-tbb
real 0m7.857s
user 0m12.734s
sys 0m0.282s

Once the computation is organized into functions which can dynamically work on different parts of the computation, it's relatively easy to proceed.

Parallelizing Matrix Multiplication using Cilk++ in Two Lines

Following the parallelization of matrix multiplication using OpenMP in Parallelizing Matrix Multiplication using OpenMP in One Line, can we do the same using Cilk++?

Parallel Matrix Multiplication using Cilk++

/* matrix.cilk */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int cilk_main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    cilk_for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

The cilk_main() is the entry function synonymous to main() in a typical C++ program, but cilk_main() function uses Cilk++ linkage such that you can spawn and sync in the function. This update is trivial.Effectively, there is only one line of non-trivial update to parallelize the serial matrix multiplication.

$ g++ -O2 matrix.cpp -o matrix
$ cilk++ -O2 matrix.cilk -o matrix-cilk
$ time ./matrix
real 0m12.613s
user 0m12.392s
sys 0m0.030s
$ time ./matrix-cilk
real 0m6.775s
user 0m12.591s
sys 0m0.134s

The performance improvement is encouraging. You can try out with quad-core or hex-core if you have the machine.

Saturday, August 14, 2010

Parallelizing Matrix Multiplication using OpenMP in One Line

Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?

Serial Matrix Multiplication

/* matrix.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.

Parallel Matrix Multiplication using OpenMP

/* matrix-omp.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
    // Initialize buffers.
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[i][j] = (float)i + j;
            b[i][j] = (float)i - j;
            c[i][j] = 0.0f;
        }
    }

    // Compute matrix multiplication.
    // C <- C + A x B
    #pragma omp parallel for default(none) shared(a,b,c)
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                c[i][j] += a[i][k] * b[k][j];
            }
        }
    }

    return 0;
}

With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.

$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp
$ time ./matrix
real 0m12.976s
user 0m12.460s
sys 0m0.059s
$ time ./matrix-omp
real 0m6.952s
user 0m12.556s
sys 0m0.050s

Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.

Wednesday, August 11, 2010

Parallel Programming - Hello World

Many computer science/engineering students learn to write Hello World program at their first programming lecture. What's your first parallel program? What about Hello World program in OpenMP, MPI, Cilk++, TBB, Ruby thread, PThread?

Hello World in C

/* hello.c */
#include <stdio.h>

int main()
{
    printf("hello world\n");
    return 0;
}

$ gcc hello.c -o hello
$ ./hello
hello world

Hello World in OpenMP

/* hello-omp.c */
#include <stdio.h>
#include <omp.h>

int main()
{
    #pragma omp parallel
    {   
        int n = omp_get_num_threads();
        int id = omp_get_thread_num();

        printf("hello world %d/%d\n", n, id);
    }   
    return 0;
}

$ gcc -fopenmp hello-omp.c -o hello-omp
$ ./hello-omp
hello world 1/2
hello world 0/2
$ ./hello-omp
hello world 0/2
hello world 1/2

Hello World in MPI

/* hello-mpi.c */
#include <stdio.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    int n, id; 
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &n);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);

    printf("hello world %d/%d\n", id, n); 

    MPI_Finalize();
    return 0;
}

$ mpicc hello-mpi.c -o hello-mpi
$ mpirun -np 2 ./hello-mpi
hello world 0/2
hello world 1/2
$ mpirun -np 2 ./hello-mpi
hello world 1/2
hello world 0/2

Hello World in Cilk++

/* hello.cilk */
#include <stdio.h>

void hello()
{
    printf("hello world\n");
}

int cilk_main()
{
    cilk_spawn hello();
    cilk_spawn hello();
    cilk_sync;
    return 0;
}

$ cilk++ hello.cilk -o hello-cilk
$ ./hello-cilk
hello world
hello world

Hello World in TBB

/* hello-tbb.cpp */
#include <iostream>
#include <tbb/parallel_for.h>

using namespace tbb;

class Hello
{
public:
    void operator()(int x) const {
        std::cout << "Hello world\n";
    }
};

int main()
{
    // parallelizing:
    // for(int i = 0; i < 2; ++i) { ... }
    parallel_for(0, 2, 1, Hello());

    return 0;
}

$ g++ hello-tbb.cpp -ltbb -o hello-tbb
$ ./hello-tbb 
Hello world
Hello world

Hello World in Ruby Thread

/* hello.rb */
#!/usr/bin/ruby

def hello
    id = Thread.current.object_id
    puts "hello world #{id}"
end

t1 = Thread.new do
    hello
end
t2 = Thread.new do
    hello
end

t1.join
t2.join

$ ./hello.rb
hello world 69858970515620
hello world 69858970515540

Hello World in PThread

/* hello-pthread.c */
#include <stdio.h>
#include <pthread.h>

void* hello(void* arg)
{
    long id = (long)pthread_self();
    printf("hello world %ld\n", id);
    pthread_exit(NULL);
}

int main()
{
    pthread_t tid1, tid2;
    pthread_attr_t attr;

    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr,
                                PTHREAD_CREATE_JOINABLE);

    pthread_create(&tid1, &attr, hello, NULL);
    pthread_create(&tid2, &attr, hello, NULL);

    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);

    pthread_attr_destroy(&attr);
    pthread_exit(NULL);

    return 0;
}

$ gcc hello-pthread.c -lpthread -o hello-pthread
$ ./hello-pthread 
hello world 139999688689424
hello world 139999680296720

Saturday, July 31, 2010

Parallel Programming - What Are The Options?

There are simply way too many parallel programming languages and libraries to keep track of. Many of them are no longer active in development, or difficult to get them working in decent operating systems. What are the practical options currently available for multi-core CPU or GPU?

OpenMP

Hardware: Shared memory multi-core CPU system.
Parallelization: Use directives e.g. #pragma omp parallel {} in C/C++/Fortran to parallelize loops or code regions.
Supported by decent compilers.
Non-supporting compilers ignore the directives and compile as serial program.
Very good for incremental parallelization.

Cilk++

Hardware: Shared memory multi-core CPU system.
Parallelization: Use new keywords in C++ namely cilk_spawn to invoke a Cilk linkage function asynchronously, cilk_sync to synchronize with locally spawned functions, cilk_for to parallelize a for-loop.
The Cilk++ runtime system takes care of the thread scheduling which ease nested parallelization tremendously and maintain certain level of efficiency.
Requires Cilk++ compiler and Cilk++ runtime system.
Very good for parallelizing dynamic codes with low overhead.

TBB

Hardware: Shared memory multi-core CPU system.
Parallelization: C++ function objects or C++0x lambda expressions as work units, parallelizing with template functions e.g. parallel_do, parallel_for, parallel_reduce, parallel_pipeline, etc. Concurrent storage classes e.g. concurrent_vector are also provided.
Portable to multiple platforms which have good C++ supports.
Uses C++ template and function object extensively. C++ beginners might have difficulty to read/write the codes.
Allow many customization options at task level which can be complicating and messy, but threads are abstracted, i.e. thread scheduling is taken care of.
Recommended only for heavy C++ users.

PThread or thread library built into languages

Hardware: Shared memory multi-core CPU system.
Parallelization: Provides a library of functions to create, destroy, synchronize threads.
Pthread is well supported on Unix/Linux systems, but Windows would require external library.
Low level and explicit manipulations of threads.
Not recommended for general parallel programming tasks.

OpenCL

Hardware: Shared memory multi-core CPU system or OpenCL supported GPU.
Parallelization: Provides a library of functions to massively execute a kernel function on a supported device.
Supported by ATI Stream SDK and Nvidia OpenCL SDK.
Requires OpenCL runtime support for the targeted devices.
Well suited for data parallel or streaming computation.
Not recommended for direct use for general parallel programming, use wrappers for OpenCL instead.

CUDA

Hardware: CUDA enabled Nvidia GPU.
Parallelization: Provides a kernel invocation method to massively execute a kernel function on a CUDA enabled Nvidia GPU. The invocation method requires CUDA compiler to parse its special syntax in the form kernel_method<<<grid_dim, block_dim,shared_mem_size,stream>>>.
Supported by Nvidia CUDA SDK.
Requires CUDA compiler and CUDA runtime system.
Well suited for data parallel or streaming computation.
The CUDA programming guide is well documented for the requirements to achieve good performance with CUDA enabled Nvidia GPU.
Recommended for gpu programming on Nvidia GPU.

Brook+

Hardware: Shared memory multi-core CPU system or CAL supported ATI GPU.
Parallelization: Allow specification of kernel function that accepts streams of data. A kernel function is invoked as per normal function. The specification of a kernel function requires Brook+ compiler to parse the syntax of the kernel function.
Supported by ATI CAL and x86 CPU backend.
Requires Brook+ compiler and Brook+ stream runtime system.
Well suited for data parallel computation.
AMD has been promoting the use of OpenCL for ATI GPU programming. Brook+ is open sourced, however, its development is no longer active.

MPI

Hardware: Shared memory multi-core CPU system or cluster of computers.
Parallelization: Provides a library of functions for message passing between processes i.e. point-to-point and collective communications.
Supported by third party library such as MPICH, OpenMPI, etc.
Requires communication runtime system.
Low level manipulations of buffers and process-process communications.
Very popular for programming HPC cluster, but not recommended for general parallel programming.

PVM

Hardware: Shared memory multi-core CPU system or distributed systems.
Parallelization: Provides a library of functions for message passing between tasks.
Supported by third party library such as Netlib PVM3.
Use standard network interface such as TCP/IP for higher interoperability over a distributed systems.
Low level manipulations of buffers and task-task communications.

Charm++

Hardware: Shared memory multi-core CPU system or distributed systems.
Parallelization: Object-oriented C++ working units where working units called chares may communicate with other chares using proxy objects.
Scheduling computations based on availability of data.
Requires Charm++ compiler and Charm++ runtime system.

uC++

Hardware: Shared memory multi-core CPU system.
Parallelization: Provides C++ coroutines for independent executions.
The runtime system performs scheduling of virtual processor using OS kernel threads.
Requires uC++ compiler and uC++ kernel.

Thursday, July 29, 2010

Who Is Responsible For The Programming Of Multi Core CPU And GPU?

Multi core CPU and GPU are now commodity products. But, where are the software that could take advantage of their parallel architecture? Who should be developing such software? The domain expert? HPC (high performance computing) sofware engineer? or parallel programming tools such as auto parallelizing compilers?

Domain experts typically do not wish to spend too much time on computing problems. They simply want to get their computation done and collect the results. Earning another degree on computing is not what they are after.
On the other hand, HPC software engineers focused very much only on computing problems, obtaining good performance with minimum domain knowledge. Learning curve for domain knowledge could be very steep for them.
The current state of art parallel programming tools are mostly parallel programming languages or libraries, performance profiling tools, code analyzer for parallelization guidelines, auto parallelizing compilers, and debugging tools.

Consequently, domain experts develop serial algorithm prototype and let HPC software engineers handle all the parallelization and performance issues. HPC software engineers then use the limited sets of parallel programming tools to parallelize and optimize the prototype.

There is serious flaw in this workflow. A parallel program needs to be designed since the beginning of the software development to be effective. When the HPC software engineers take over the prototype for optimization and parallelization, what they can do is too limited.

Without the domain knowledge, the HPC software engineers are essentially working as an auto parallelizing compiler. They analyze the code for real data dependency instead of analyzing the high-level algorithm structure, they optimize and parallelize the algorithms used in the prototype maintaining the same serial semantics instead of using or improving the best algorithms for the tasks. They are unable to perform intrusive algorithm or code modifications necessary for good performance.

See also Why Can't Compilers Auto Parallelize Serial Code Effectively?

Current state of art parallel programming tools are not able to fully automate the optimization and parallelization process. Relying purely on tools is out of the question to optimization and parallelization.

When a parallel program is not performing, who's fault is that?

Ideally domain experts and HPC software engineers should collaborate and recognize the issues from each other. The impact of domain knowledge onto the optimization and parallelization process should not be underestimated. The steep learning curve of the domain knowledge is probably unavoidable, while domain experts should be more computation oriented formulating their domain problems into computing problems.

Bioinformatics set a good example in formulating problems into computing problems, enabling many computer scientists to work on bioinformatics problems with minimum domain knowledge.

Thursday, July 22, 2010

Where Are All The Practical Parallel Algorithms and Libraries?

Multi-core CPU and GPU are everywhere nowadays from laptops to desktops to high-end computing clusters. Is your particular application running any faster? Nope. But generally you need parallel algorithms for an application to make full use of the multiple cores.

Perhaps you'll expect doing some searches on the web, research publications and academic books would provide you all the state of art parallel algorithms. Many of them are for academic & research purpose, many are written in distinguished programming languages, many focused on special prototype or no longer available hardware, many that you'll hardly need, ... many of them since parallel programming has been studied for decades.

What about practical parallel algorithms or library written in modern programming languages be it C/C++, Ruby, Python, Java, ... which can be incorporated easily into your own software development?

Unfortunately you don't find a lot of them. There are practical reasons for that.

1. Lack Of Widely Accepted Composable Parallel Programming Model
A parallel programming model is composable if the model allows a parallel algorithm to be used transparently within another parallel algorithm and maintain certain level of efficiency. This is essential for building up modular foundation for a parallel algorithm library.

The needs to manually synchronize multiple cores with certain forms of locks or atomic instructions break the transparent requirement. And it's difficult to achieve good efficiency when a parallel algorithm is nested within other parallel algorithms. Excessive synchronizations would incur too much overhead for parallel programming to be beneficial. With application specific knowledge, one may move synchronizations of a parallel algorithm to higher levels to improve efficiency.

Tricks are often used to maximize the efficiency trading off composability. Hence, a parallel programming model still needs to support non-composable mechanisms.

2. Requires Specific Memory Layout And Efficiency Matters
A parallel algorithm may be implemented assuming certain memory layout for its input and output data. Multiple cores would access the data in very specific ways or patterns.

When you obtain a parallel algorithm implementation from another, it's almost certain that the assumed memory layout is not compatible with your software interface. You may do data transformation to make use of the parallel algorithm transparently. This can be extremely expensive not only the in/out transformation itself would take up computation time, it will require additional memory to maintain the transformed structures.

When the degradation of the efficiency is not tolerable, you'll then implement your own parallel algorithm with the ideal memory layout for your application.

Hence, it's not uncommon to see many implementations for similar parallel algorithms around because we can't get them out of their existing environments without incurring much engineering efforts.

3. Requires Hardware Specific Optimization For Efficiency And Efficiency Matters
Programming on parallel clusters would worry about configurations such as the number of nodes, the number of processors per node, the number of cores per processor, and their inter-connectivity.

Programming on shared memory machines would worry about the number of processors, the number of cores per processor, the cache hierarchy, and the inter-core connectivity.

When you have a very high bandwidth inter-connectivity, you would perform data or task partitioning, result aggregations in a very specific way to take advantage of that. On the other hand, when the inter-connectivity has limited bandwidth, you would customize the parallel algorithm differently, or even use other less efficient parallel algorithms for the equivalent computation.

Shared memory programming would have very specific requirements on the memory layout and data access pattern for good efficiency on the underlying hardware. Different organizations of the processor cores, inter-core connections would require different sets of optimizations.

The great variety in the hardware organizations is the fundamental issue in supporting portable performance across multiple hardwares.

As performance matters, performance optimization and tuning is a crucial part of a parallel algorithm development. The portability is certainly limited. Fortunately, this may be rectified with auto-tuning mechanism.

4. Parallel Programming Was Expensive In The Past
Before the multi-core era, parallel programming is generally not affordable. Parallel programming has been very much restricted to research labs and very large corporations.

Research labs and large corporations typically deployed very large parallel systems for solving only very large problems. As shared memory machines are generally more expensive and limited in its maximum configuration size, distributed or cluster computing was the preferred choice.

Hardware configurations for large distributed systems can be complicated and confusing, especially when new systems and old systems are merged into larger systems over time.

With research efforts mostly focusing on such complicated systems, and engineering efforts mostly deal with performance issues rather than practical software development issues, do we expect cross platforms and performance portable parallel libraries? I think that's difficult.

Although there is also a significant amount of research effort on shared memory programming, shared memory machines have their own sets of issues to achieve scalable performance.

Simple parallel libraries for small machines are not interesting enough for researchers to work on. However, with the advent of multi-core processors, parallel programming becomes more affordable and ubiquitous. Sacrificing certain level of performance is reasonable for wider coverage. Parallel libraries work only for small processor core count can also be extremely useful.

Wednesday, July 21, 2010

Why Is Parallel Programming Difficult?

Parallel programming is generally perceived as an activity only for people going after high tech, bleeding edge research. It is difficult and alien enough to drive most software engineers away, whether it is really the case or merely their misconceptions. The fact is, software engineers run away from parallel programming while modern general purpose processors consist more and more multiple cores by force and by default. These processor cores are seriously under utilized.

Is parallel programming new? Nope. It has been studied for decades.

Languages and tools available for parallel programming? Yes. OpenMP, PVM, MPI, Cuda, OpenCL, Cilk++, PThreads, Java threads, ...

But, why is parallel programming difficult when there are tons publications and tools widely available? Isn't it a know how issue? When you master the theory, it should be simple, right? Let's see.

1. Managing Complex Data Dependency Is Non-Trivial

To design a parallel algorithm, the data dependency must be well managed. For statements such as,

A1. x = 10 + 5
A2. y = 20 - 8
A3. z = x * y

Statement A1 and A2 may be executed in parallel at different cores, while statement A3 must be executed only after x and y are available to the core executing statement A3. Not only we need to ensure proper order of execution, data required for the execution of a statement must be delivered in time to the core executing the statement.

When you instruct someone to do something, naturally you say,

B1. Drink the glass of milk at your hand.
B2. Walk to the kitchen.
B3. Wash the glass.

Most people design instructions with a series of actions. You wouldn't bother to specify washing the glass only after you finished drinking the milk and after you arrived in the kitchen near the sink, would you?

Functional programming is an interesting approach to simplify the data dependency complexity by using multiple definitions of stateless computation functions. A computation function definition inherently asks which data is required to compute, forcing one to specify the data dependency implicitly. However, not all computations can be expressed functionally in a convenient way.

2. Managing Data Access Conflicts Without Full Access Control
The system main memory is shared across multiple cores. It is handy when you need to maintain only single copy of data for multiple cores.

A core might be incrementing a variable with the following statements sequentially,

C1. y = x
C2. y += 1
C3. x = y

Since the memory is shared across multiple cores, x could be updated by another core after statement C1 and before statement C3 are executed. The update is effectively lost when statement C3 is then executed writing value of y into x.

Critical section, locking mechanism, or atomic instructions are introduced to protect or synchronize accesses to the variables ensuring no other cores will access the memory during the execution of these statements. This is essentially a passive approach. The programmer may forget to protect or by mistake having the variables unprotected leaving a silent loop hole behind and the execution can go on without any warnings.

This can get extremely complicated when memory pointers are used extensively without much control. You wouldn't know if two pointers are pointing to the same memory location unless you check for that explicitly. The explicit checking would incur too much overhead.

3. Managing Multiple Core Executions Driving One Insane
Current state of art parallel program is usually a single program being executed by multiple cores in parallel, i.e. each core executes a separate copy of the same program. Each of the cores performs computations on different range of data based on their uniquely associated IDs.

When you writes a program, you will simulate that program in your mind to make sure it does what you intended to compute. The same goes to parallel program. However, now you have to simulate the program for the executions of different number of cores and ensure each core would execute correctly even when there is no data for some cores to compute.

This is further complicated by the needs for multiple cores to execute different parts of the program at the same time. How many instances of core executions can you keep track in your mind at a time?

The amount of simulations required would increase multiple folds, especially when you are dealing with multi-dimensional data with many different corner cases.

This is evident especially when a parallel program does not work as expected, and you start debugging the program using your favorite tools. You may examine what are each core executing, and step through the program for each of the cores. Yes, for each of the cores! The big challenge is, how do you trace or back trace to find the problematic codes?

4. Non-Deterministic Behavior Is Confusing
Executing a parallel program multiple times may generate different results. The exact results depend largely on the exact data or task partitioning applied, the specific order of executions, the specific scheduling & allocations of processors at runtime, and the nature of the computation. Sometimes the execution incurs integer overflow, sometimes it doesn't. You get the point? In many cases, you can't simply compare the results with the serial version as the parallel version may be executing completely different computations dynamically.

How do you ensure your parallel program is written correctly when it's difficult to prove mathematically and you can't provide good and consistent reference results? It's simply difficult. Fortunately, it's less problematic for more structural computations which cover many scientific algorithms.

Introducing additional synchronization between the cores may reduce the degree of non-deterministic behavior. Consequently, that could reduces the overall performance significantly too which defeats the purpose of parallel programming itself.

It also posts difficulty to do comprehensive benchmarking. Depending on the nature of the parallel computation, the benchmarking results could be misleading as the total amount of computation, patterns of memory accesses etc. may vary across executions, even for the same input data. More input data sets should be used to obtain statistically significant results to draw meaningful conclusions.

How do you compare the performance of the parallel version against the serial or reference version when the computation could be completely different? It's simply confusing.

5. Lack Of Parallel Library For General Use
We hardly write everything from scratch. We often make use of reusable library for our tasks in hand. That makes programming much more manageable and simpler. Complicated tasks are encapsulated inside software libraries.

If we use parallel library as much as possible such as parallel sorting, concurrent hashing, parallel set, parallel binary tree search, etc. in our application, the application inherently becomes a parallel version, right? That should make parallel programming much easier.

But you don't find many reusable parallel library readily available, do you? Check out the article on Where Are All The Practical Parallel Algorithms and Libraries?

6. Performance Matters
Performance is one of the strongest points for one to make parallel programming difficult. It defeats the purpose if the performance gain is insignificant. For this single reason, one is willing to do the following,

Perform dynamic load balancing for uneven computation units.
Apply hardware and application specific task scheduling and processor allocation to maximize the processor utilization with other cost minimization objectives e.g. minimizing the distance between two frequently communicating cores, minimizing context switching, minimizing working sets, etc.
Apply hardware specific compute granularity optimization - coarse grain vs fine grain.
Apply hardware specific cache optimization to minimize memory fetches as multiple cores are sharing the memory, hence sharing the outstanding memory fetching slots and memory bandwidth.
Apply application specific coarse grain synchronization of memory trading off structured codes and maintainability for lower synchronization overhead.
Use traditional programming language such as C/C++, Fortran etc. trading off the ease of programming in modern dynamic scripting language for performance.
Develop application specific routines with hardware specific optimization trading off portability and re-usability for performance.
Design hardware and application specific memory layout trading off portability and re-usability for performance.
Apply I/O storage optimization to minimize I/O overhead and improving sustainable I/O bandwidth.

You may notice that some of the points above are not directly related to parallel programming. Nevertheless, they affect performance. So they may be included as part of the parallel programming process.

7. Legacy Software Is Often Used As The Entry Point To Parallelization
It is expensive to build a software from scratch. It makes sense to build a parallel version based on the existing serial version. Unfortunately, the software architecture of the serial version may not be suitable for building a parallel version.

Incremental parallelization sounds nice, but not always practical. Effective parallelization could require massive changes to the data structures for better data partitioning, data accesses to synchronize shared memory locations, computation structures for coarse grain parallelism, or even the algorithms for better parallelization. Global variables which are often found in serial program is particularly problematic for parallelization. How well structured is the data dependency in the legacy software is playing an important role to effective parallelization.

For more structural algorithms as of many scientific algorithms, they are always used for large-scale data, and their data dependency is often well structured. Then, parallelizing the serial version would be a good idea.