Showing posts with label OpenMP. Show all posts
Showing posts with label OpenMP. Show all posts

Saturday, August 14, 2010

Parallelizing Matrix Multiplication using OpenMP in One Line

Matrix multiplication is often used for academic study. It's well suited for parallelization due to its intensive O(N^3) computation and independent computation. Parallel programming is hard. Does it surprise you if we parallelize matrix multiplication in merely one line of OpenMP directive?

Serial Matrix Multiplication

/* matrix.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

return 0;
}
Analyzing the data dependency of the operation c[i][j] += a[i][k] * b[k][j], we can see that each iteration i is independent of each other. The same applies to iteration j. For simplicity, we consider parallelizing only single for-loop.

Parallel Matrix Multiplication using OpenMP

/* matrix-omp.cpp */
const int size = 1000;

float a[size][size];
float b[size][size];
float c[size][size];

int main()
{
// Initialize buffers.
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}

// Compute matrix multiplication.
// C <- C + A x B
#pragma omp parallel for default(none) shared(a,b,c)
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

return 0;
}
With the OpenMP directive (#pragma), the i-for-loop is divided into multiple chunks, each chunk is assigned to a thread. Hence, multiple threads can compute assigned chunks in parallel. That's it.
$ g++ -O2 matrix.cpp -o matrix
$ g++ -O2 -fopenmp matrix-omp.cpp -o matrix-omp
$ time ./matrix
real 0m12.976s
user 0m12.460s
sys 0m0.059s
$ time ./matrix-omp
real 0m6.952s
user 0m12.556s
sys 0m0.050s
Bravo! The parallel version is about 1.86x faster than the serial version on a dual-core system. However, the code above is only for illustration purpose. There are still many optimization opportunities before it achieves commercial performance level.

Wednesday, August 11, 2010

Parallel Programming - Hello World

Many computer science/engineering students learn to write Hello World program at their first programming lecture. What's your first parallel program? What about Hello World program in OpenMP, MPI, Cilk++, TBB, Ruby thread, PThread?

Hello World in C

/* hello.c */
#include <stdio.h>

int main()
{
printf("hello world\n");
return 0;
}
$ gcc hello.c -o hello
$ ./hello
hello world

Hello World in OpenMP

/* hello-omp.c */
#include <stdio.h>
#include <omp.h>

int main()
{
#pragma omp parallel
{
int n = omp_get_num_threads();
int id = omp_get_thread_num();

printf("hello world %d/%d\n", n, id);
}
return 0;
}
$ gcc -fopenmp hello-omp.c -o hello-omp
$ ./hello-omp
hello world 1/2
hello world 0/2
$ ./hello-omp
hello world 0/2
hello world 1/2

Hello World in MPI

/* hello-mpi.c */
#include <stdio.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
int n, id;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &n);
MPI_Comm_rank(MPI_COMM_WORLD, &id);


printf("hello world %d/%d\n", id, n);

MPI_Finalize();
return 0;
}
$ mpicc hello-mpi.c -o hello-mpi
$ mpirun -np 2 ./hello-mpi
hello world 0/2
hello world 1/2
$ mpirun -np 2 ./hello-mpi
hello world 1/2
hello world 0/2

Hello World in Cilk++

/* hello.cilk */
#include <stdio.h>

void hello()
{
printf("hello world\n");
}

int cilk_main()
{
cilk_spawn hello();
cilk_spawn hello();
cilk_sync;
return 0;
}
$ cilk++ hello.cilk -o hello-cilk
$ ./hello-cilk
hello world
hello world

Hello World in TBB

/* hello-tbb.cpp */
#include <iostream>
#include <tbb/parallel_for.h>

using namespace tbb;

class Hello
{
public:
void operator()(int x) const {
std::cout << "Hello world\n";
}
};

int main()
{
// parallelizing:
// for(int i = 0; i < 2; ++i) { ... }
parallel_for(0, 2, 1, Hello());

return 0;
}
$ g++ hello-tbb.cpp -ltbb -o hello-tbb
$ ./hello-tbb
Hello world
Hello world

Hello World in Ruby Thread

/* hello.rb */
#!/usr/bin/ruby

def hello
id = Thread.current.object_id
puts "hello world #{id}"
end

t1 = Thread.new do
hello
end
t2 = Thread.new do
hello
end

t1.join
t2.join

$ ./hello.rb
hello world 69858970515620
hello world 69858970515540

Hello World in PThread

/* hello-pthread.c */
#include <stdio.h>
#include <pthread.h>

void* hello(void* arg)
{
long id = (long)pthread_self();
printf("hello world %ld\n", id);
pthread_exit(NULL);
}

int main()
{
pthread_t tid1, tid2;
pthread_attr_t attr;

pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr,
PTHREAD_CREATE_JOINABLE);

pthread_create(&tid1, &attr, hello, NULL);
pthread_create(&tid2, &attr, hello, NULL);

pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

pthread_attr_destroy(&attr);
pthread_exit(NULL);


return 0;
}
$ gcc hello-pthread.c -lpthread -o hello-pthread
$ ./hello-pthread
hello world 139999688689424
hello world 139999680296720

Saturday, July 31, 2010

Parallel Programming - What Are The Options?

There are simply way too many parallel programming languages and libraries to keep track of. Many of them are no longer active in development, or difficult to get them working in decent operating systems. What are the practical options currently available for multi-core CPU or GPU?
  1. OpenMP
  • Hardware: Shared memory multi-core CPU system.
  • Parallelization: Use directives e.g. #pragma omp parallel {} in C/C++/Fortran to parallelize loops or code regions.
  • Supported by decent compilers.
  • Non-supporting compilers ignore the directives and compile as serial program.
  • Very good for incremental parallelization.
  • Cilk++
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: Use new keywords in C++ namely cilk_spawn to invoke a Cilk linkage function asynchronously, cilk_sync to synchronize with locally spawned functions, cilk_for to parallelize a for-loop.
    • The Cilk++ runtime system takes care of the thread scheduling which ease nested parallelization tremendously and maintain certain level of efficiency.
    • Requires Cilk++ compiler and Cilk++ runtime system.
    • Very good for parallelizing dynamic codes with low overhead.
  • TBB
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: C++ function objects or C++0x lambda expressions as work units, parallelizing with template functions e.g. parallel_do, parallel_for, parallel_reduce, parallel_pipeline, etc. Concurrent storage classes e.g. concurrent_vector are also provided.
    • Portable to multiple platforms which have good C++ supports.
    • Uses C++ template and function object extensively. C++ beginners might have difficulty to read/write the codes.
    • Allow many customization options at task level which can be complicating and messy, but threads are abstracted, i.e. thread scheduling is taken care of.
    • Recommended only for heavy C++ users.
  • PThread or thread library built into languages
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: Provides a library of functions to create, destroy, synchronize threads.
    • Pthread is well supported on Unix/Linux systems, but Windows would require external library.
    • Low level and explicit manipulations of threads.
    • Not recommended for general parallel programming tasks.
  • OpenCL
    • Hardware: Shared memory multi-core CPU system or OpenCL supported GPU.
    • Parallelization: Provides a library of functions to massively execute a kernel function on a supported device.
    • Supported by ATI Stream SDK and Nvidia OpenCL SDK.
    • Requires OpenCL runtime support for the targeted devices.
    • Well suited for data parallel or streaming computation.
    • Not recommended for direct use for general parallel programming, use wrappers for OpenCL instead.
  • CUDA
    • Hardware: CUDA enabled Nvidia GPU.
    • Parallelization: Provides a kernel invocation method to massively execute a kernel function on a CUDA enabled Nvidia GPU. The invocation method requires CUDA compiler to parse its special syntax in the form kernel_method<<<grid_dim, block_dim,shared_mem_size,stream>>>.
    • Supported by Nvidia CUDA SDK.
    • Requires CUDA compiler and CUDA runtime system.
    • Well suited for data parallel or streaming computation.
    • The CUDA programming guide is well documented for the requirements to achieve good performance with CUDA enabled Nvidia GPU.
    • Recommended for gpu programming on Nvidia GPU.
  • Brook+
    • Hardware: Shared memory multi-core CPU system or CAL supported ATI GPU.
    • Parallelization: Allow specification of kernel function that accepts streams of data. A kernel function is invoked as per normal function. The specification of a kernel function requires Brook+ compiler to parse the syntax of the kernel function.
    • Supported by ATI CAL and x86 CPU backend.
    • Requires Brook+ compiler and Brook+ stream runtime system.
    • Well suited for data parallel computation.
    • AMD has been promoting the use of OpenCL for ATI GPU programming. Brook+ is open sourced, however, its development is no longer active.
  • MPI
    • Hardware: Shared memory multi-core CPU system or cluster of computers.
    • Parallelization: Provides a library of functions for message passing between processes i.e. point-to-point and collective communications.
    • Supported by third party library such as MPICHOpenMPI, etc.
    • Requires communication runtime system.
    • Low level manipulations of buffers and process-process communications.
    • Very popular for programming HPC cluster, but not recommended for general parallel programming.
  • PVM
    • Hardware: Shared memory multi-core CPU system or distributed systems.
    • Parallelization: Provides a library of functions for message passing between tasks.
    • Supported by third party library such as Netlib PVM3.
    • Use standard network interface such as TCP/IP for higher interoperability over a distributed systems.
    • Low level manipulations of buffers and task-task communications.
  • Charm++
    • Hardware: Shared memory multi-core CPU system or distributed systems.
    • Parallelization: Object-oriented C++ working units where working units called chares may communicate with other chares using proxy objects.
    • Scheduling computations based on availability of data.
    • Requires Charm++ compiler and Charm++ runtime system.
  • uC++
    • Hardware: Shared memory multi-core CPU system.
    • Parallelization: Provides C++ coroutines for independent executions.
    • The runtime system performs scheduling of virtual processor using OS kernel threads.
    • Requires uC++ compiler and uC++ kernel.