SpeedGo Computing: CUDA

Showing posts with label CUDA. Show all posts

Monday, June 18, 2012

Personal Supercomputing System with Quad GPUs

The secrets of the new Kepler GPUs have been revealed. The Kepler based graphics cards have been studied extensively for gaming performance. Most would suggest you don't need the upgrade. Furthermore, the supported PCI-E 3.0 is of little to no use.

Well, it's probably a different story for CUDA programs. Here's the setup I'm going to use for testing CUDA programs extensively.

Asus P8Z99-V Premium

Supports dual 16x PCI-E. Most other boards support only single 16x.
Built-in 32GB SSD storage.
At the time of writing, the system is not fully functional when the 2nd GTX 690 is installed. The BIOS probably needs update to fix the problems.

Intel Ivy Bridge Core i7 3770K 3.5GHz/8MB (3.9GHz at Turbo mode)
Corsair Vengeance 1600MHz 32GB
Asus GTX 690 4GB x 2
Kingston HyperX SSD 3K SATA3 240GB
Western Digital Black SATA3 2TB
Corsair AX1200 1200W
Antec Kuhler 920 (Liquid cooling)

The CPU is extremely hot 80+ C with stock fan at full load. Liquid cooling is highly recommended.
Unfortunately, the fan control software is rather limited. It controls the fan based on the liquid temperature which is very different from the CPU temperature. As a result, the fan does not automatically ramp up on high CPU temperature as the liquid temperature may be much lower.

Asus GTX 690 in a box

GTX 690 is much longer than old GTX 285. Prepare a long casing for the card.

Full system with dual GTX 690

A closer look at the LED illuminating GTX

Sunday, June 26, 2011

Being Nvidia CUDA Certified Programmer!

It takes some courage and effort to take the Nvidia CUDA Certification exam. You'll have to pay S$350 for that yet there is no guarantee of real use in business and career. The exam questions are perfect to squeeze out all your brain juice.

After much feedback and long awaiting, delayed plans, finally I received an email about being Nvidia CUDA certified programmer now. It's better arrived late than never. But what's next?

Let's call for all Nvidia CUDA Certified Programmer, starting from Singapore. Anyone else CUDA certified? What's your plan on CUDA?

Monday, May 9, 2011

The Choice is Yours: CUDA in C++ or Ruby

See the output here: Ruby Query Output

See the output here: C++ Query Output

Tuesday, May 3, 2011

Web Seminar: Programming GPUs Beyond CUDA

GPU/CUDA programming is easy if we ignore the performance, or even the correctness of the program. It becomes tough when the performance is critical, one has to optimize very hard on the specific hardware. Fortunately, GPU hardware performance improves drastically every 2 years. Unfortunately, the performance is not portable across different generations of GPUs.

Prof Chen from Tshing Hua University is proposing MapCG, a MapReduce framework as a resolution to the portability problem.

Check out the details of the seminar in the following link:

http://www.gpucomputing.net/?q=node/5277

Saturday, April 30, 2011

First Release of SGC Ruby CUDA - Beginning of a long way path

Today we decided to put up the first release of the SGC Ruby CUDA v0.1.0 as a mean to attract Rubyists to try out GPU programming as their new toy projects, and also to encourage HPC developers to evaluate if Ruby is good to use for their HPC applications.

When important software libraries are not available in Ruby, we certainly do not expect much Ruby usage in the area. As time is running short, more and more hardware is piling up underutilized, we are urged to take the first fundamental step moving Ruby programming towards HPC applications by making important SDK such as Nvidia CUDA SDK available to a Ruby program.

Rubyists who are new to GPU programming can now access CUDA GPUs easily to harness the massively parallel architecture of GPUs. On the other spectrum, HPC developers now have a choice to manage their complex applications by large portion in Ruby, while retaining only relatively small section of codes in C/C++/CUDA C etc.

We believe Ruby programming could improve productivity and maintainability tremendously since in many cases, heavy computation only happens in small section of codes, and Ruby programming simplify the software architecture and implementation significantly. Even when the performance is extremely critical that one must port everything back from Ruby to C/C++/CUDA C for highest performance, one has already saved tremendous effort in software architecture and design to achieve manageable design, extendable, ease of use, etc. The porting back to C/C++/CUDA C becomes much more straightforward as one has gained much knowledge about the domain.

Compared to developing a complex application from scratch in C/C++/CUDA C, one has to go through unforeseeable curvy path to achieve the same state which is bound to very high failure rate. Hence, we believe that this could set the start of Ruby programming towards HPC applications.

SGC Ruby CUDA has been updated significantly since the last post about it. As we have packaged it into a Ruby gem, you can now install it with

gem install sgc-ruby-cuda

The code repository is hosted at github, SGC Ruby CUDA.
The documentations are available at rubydoc.info, SGC Ruby CUDA Doc.
Feel free to join the discussion group/mailing list at SGC Ruby CUDA Google Group.

Friday, November 19, 2010

Using SGC-Ruby-CUDA on the Newly Launched Amazon EC2 Cluster GPU

Wonder if GPU works for you? No budget for a system with decent GPU? Installations and configurations are too much trouble for you? You can now try out SGC-Ruby-CUDA on Amazon EC2 with the system image, located at US East Virginia zone, called SGCRubyCUDA.1 which is available as a community AMI.

Compile for rubycu shared library and run tests.

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake
(in /root/sgc-ruby-cuda.git)
checking for main() in -lcuda... yes
creating Makefile
g++44 -I. -I/usr/local/include/ruby-1.9.1/x86_64-linux -I/usr/local/include/ruby-1.9.1/ruby/backward -I/usr/local/include/ruby-1.9.1 -I.   -fPIC -O3 -ggdb -Wextra -Wno-unused-parameter -Wno-parentheses -Wpointer-arith -Wwrite-strings -Wno-missing-field-initializers -Wno-long-long   -o rubycu.o -c rubycu.cpp
g++44 -shared -o rubycu.so rubycu.o -L. -L/usr/local/lib -Wl,-R/usr/local/lib -L.  -rdynamic -Wl,-export-dynamic   -lcuda  -lpthread -lrt -ldl -lcrypt -lm   -lc

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake test
(in /root/sgc-ruby-cuda.git)
/usr/local/bin/ruby -I"lib:lib" "/usr/local/lib/ruby/1.9.1/rake/rake_test_loader.rb" "test/test_rubycu.rb" 
Loaded suite /usr/local/lib/ruby/1.9.1/rake/rake_test_loader
Started
......................
Finished in 89.055900 seconds.

22 tests, 99 assertions, 0 failures, 0 errors, 0 skips

Test run options: --seed 25668

Compile for rubygems then install it and try some SGC-Ruby-CUDA APIs.

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# rake gem
(in /root/sgc-ruby-cuda.git)
mkdir -p pkg
  Successfully built RubyGem
  Name: sgc-ruby-cuda
  Version: 0.0.1
  File: sgc-ruby-cuda-0.0.1-x86_64-linux.gem
mv sgc-ruby-cuda-0.0.1-x86_64-linux.gem pkg/sgc-ruby-cuda-0.0.1-x86_64-linux.gem

[root@ip-10-17-130-174 sgc-ruby-cuda.git]# cd pkg
[root@ip-10-17-130-174 pkg]# gem install sgc-ruby-cuda-0.0.1-x86_64-linux.gem 
Successfully installed sgc-ruby-cuda-0.0.1-x86_64-linux
1 gem installed
Installing ri documentation for sgc-ruby-cuda-0.0.1-x86_64-linux...
Installing RDoc documentation for sgc-ruby-cuda-0.0.1-x86_64-linux...

[root@ip-10-17-130-174 pkg]# gem list

*** LOCAL GEMS ***

minitest (1.6.0)
rake (0.8.7)
rdoc (2.5.8)
sgc-ruby-cuda (0.0.1 x86_64-linux)

[root@ip-10-17-130-174 pkg]# irb
irb(main):001:0> require 'rubycu'
=> true
irb(main):002:0> include SGC::CU
=> Object
irb(main):004:0> CUDevice.get_count
=> 2
irb(main):005:0> d = CUDevice.get(0)
=> #<SGC::CU::CUDevice:0x0000000908c920>
irb(main):006:0> c = CUContext.new.create(0, d)
=> #<SGC::CU::CUContext:0x0000000907af40>
irb(main):007:0> d.get_name
=> "Tesla M2050"
irb(main):009:0> d.compute_capability
=> {:major=>2, :minor=>0}
irb(main):010:0> d.total_mem
=> 2817982464

Note: Remember to select Cluster GPU, when launching the instance.

Friday, September 17, 2010

Unigine crew: CUDA vs OpenCL vs SPU Part IV

Which language or library you choose to use for your software development has great and prolong impact to the software. Just come across a simple yet interesting benchmark. Perhaps, more details on why such numbers are obtained would be even more enlightening.

Unigine crew: CUDA vs OpenCL vs SPU Part IV

CUDA Programming with Ruby

Need GPU computing power in your Ruby program? Great! SpeedGo Computing is developing Ruby bindings for CUDA, called sgc-ruby-cuda. Take advantage of your Nvidia CUDA-enabled graphics cards with Ruby now.

Currently, only part of the CUDA Driver API is included. More components such as the CUDA Runtime API will be included to make it as complete as possible.

CUDA Programming with Ruby


require 'rubycu'

include SGC::CU

SIZE = 10 
c = CUContext.new

d = CUDevice.get(0)   # Get the first device.
c.create(0, d)    # Use this device in this CUDA context.

m = CUModule.new
m.load("vadd.ptx")    # 'nvcc -ptx vadd.cu'
                      # vadd.cu is a CUDA kernel program.

da = CUDevicePtr.new    # Pointer to device memory.
db = CUDevicePtr.new
dc = CUDevicePtr.new

da.mem_alloc(4*SIZE)    # Each Int32 is 4 bytes.
db.mem_alloc(4*SIZE)    # Allocate device memory.
dc.mem_alloc(4*SIZE)

ha = Int32Buffer.new(SIZE)    # Allocate host memory.
hb = Int32Buffer.new(SIZE)
hc = Int32Buffer.new(SIZE)
hd = Int32Buffer.new(SIZE)

(0...SIZE).each { |i| ha[i] = i }
(0...SIZE).each { |i| hb[i] = 2 }
(0...SIZE).each { |i| hc[i] = ha[i] + hb[i] }
(0...SIZE).each { |i| hd[i] = 0 }

memcpy_htod(da, ha, 4*SIZE)  # Transfer inputs to device.
memcpy_htod(db, hb, 4*SIZE)

f = m.get_function("vadd");
f.set_param(da, db, dc, SIZE)
f.set_block_shape(SIZE)
f.launch_grid(1)  # Execute kernel program in the device.

memcpy_dtoh(hd, dc, 4*SIZE) # Transfer outputs to host.

puts "A\tB\tCPU\tGPU"
(0...SIZE).each { |i| 
    puts "#{ ha[i]}\t#{hb[i]}\t#{hc[i]}\t#{hd[i] }" 
}

da.mem_free    # Free device memory.
db.mem_free
dc.mem_free

c.detach    # Release context.


/* vadd.cu */
extern "C" {
    __global__ void vadd(const int* a,
                         const int* b,
                         int* c,
                         int n)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n)
            c[i] = a[i] + b[i];
    }
}

Although the kernel program still need to be written in CUDA C, this Ruby bindings have provided first bridging step towards Ruby GPU computing.

How to execute?


$ ruby extconf.rb
checking for main() in -lcuda... yes
creating Makefile
$ make
...
g++ -shared -o rubycu.so rubycu.o ...
$ nvcc -ptx vadd.cu
$ ruby -I . test.rb
A       B       CPU     GPU
0       2       2       2
1       2       3       3
2       2       4       4
3       2       5       5
4       2       6       6
5       2       7       7
6       2       8       8
7       2       9       9
8       2       10      10
9       2       11      11

Cool! The summation of two vectors is performed in the GPU.

See also:

Saturday, July 31, 2010

Parallel Programming - What Are The Options?

There are simply way too many parallel programming languages and libraries to keep track of. Many of them are no longer active in development, or difficult to get them working in decent operating systems. What are the practical options currently available for multi-core CPU or GPU?

OpenMP

Hardware: Shared memory multi-core CPU system.
Parallelization: Use directives e.g. #pragma omp parallel {} in C/C++/Fortran to parallelize loops or code regions.
Supported by decent compilers.
Non-supporting compilers ignore the directives and compile as serial program.
Very good for incremental parallelization.

Cilk++

Hardware: Shared memory multi-core CPU system.
Parallelization: Use new keywords in C++ namely cilk_spawn to invoke a Cilk linkage function asynchronously, cilk_sync to synchronize with locally spawned functions, cilk_for to parallelize a for-loop.
The Cilk++ runtime system takes care of the thread scheduling which ease nested parallelization tremendously and maintain certain level of efficiency.
Requires Cilk++ compiler and Cilk++ runtime system.
Very good for parallelizing dynamic codes with low overhead.

TBB

Hardware: Shared memory multi-core CPU system.
Parallelization: C++ function objects or C++0x lambda expressions as work units, parallelizing with template functions e.g. parallel_do, parallel_for, parallel_reduce, parallel_pipeline, etc. Concurrent storage classes e.g. concurrent_vector are also provided.
Portable to multiple platforms which have good C++ supports.
Uses C++ template and function object extensively. C++ beginners might have difficulty to read/write the codes.
Allow many customization options at task level which can be complicating and messy, but threads are abstracted, i.e. thread scheduling is taken care of.
Recommended only for heavy C++ users.

PThread or thread library built into languages

Hardware: Shared memory multi-core CPU system.
Parallelization: Provides a library of functions to create, destroy, synchronize threads.
Pthread is well supported on Unix/Linux systems, but Windows would require external library.
Low level and explicit manipulations of threads.
Not recommended for general parallel programming tasks.

OpenCL

Hardware: Shared memory multi-core CPU system or OpenCL supported GPU.
Parallelization: Provides a library of functions to massively execute a kernel function on a supported device.
Supported by ATI Stream SDK and Nvidia OpenCL SDK.
Requires OpenCL runtime support for the targeted devices.
Well suited for data parallel or streaming computation.
Not recommended for direct use for general parallel programming, use wrappers for OpenCL instead.

CUDA

Hardware: CUDA enabled Nvidia GPU.
Parallelization: Provides a kernel invocation method to massively execute a kernel function on a CUDA enabled Nvidia GPU. The invocation method requires CUDA compiler to parse its special syntax in the form kernel_method<<<grid_dim, block_dim,shared_mem_size,stream>>>.
Supported by Nvidia CUDA SDK.
Requires CUDA compiler and CUDA runtime system.
Well suited for data parallel or streaming computation.
The CUDA programming guide is well documented for the requirements to achieve good performance with CUDA enabled Nvidia GPU.
Recommended for gpu programming on Nvidia GPU.

Brook+

Hardware: Shared memory multi-core CPU system or CAL supported ATI GPU.
Parallelization: Allow specification of kernel function that accepts streams of data. A kernel function is invoked as per normal function. The specification of a kernel function requires Brook+ compiler to parse the syntax of the kernel function.
Supported by ATI CAL and x86 CPU backend.
Requires Brook+ compiler and Brook+ stream runtime system.
Well suited for data parallel computation.
AMD has been promoting the use of OpenCL for ATI GPU programming. Brook+ is open sourced, however, its development is no longer active.

MPI

Hardware: Shared memory multi-core CPU system or cluster of computers.
Parallelization: Provides a library of functions for message passing between processes i.e. point-to-point and collective communications.
Supported by third party library such as MPICH, OpenMPI, etc.
Requires communication runtime system.
Low level manipulations of buffers and process-process communications.
Very popular for programming HPC cluster, but not recommended for general parallel programming.

PVM

Hardware: Shared memory multi-core CPU system or distributed systems.
Parallelization: Provides a library of functions for message passing between tasks.
Supported by third party library such as Netlib PVM3.
Use standard network interface such as TCP/IP for higher interoperability over a distributed systems.
Low level manipulations of buffers and task-task communications.

Charm++

Hardware: Shared memory multi-core CPU system or distributed systems.
Parallelization: Object-oriented C++ working units where working units called chares may communicate with other chares using proxy objects.
Scheduling computations based on availability of data.
Requires Charm++ compiler and Charm++ runtime system.

uC++

Hardware: Shared memory multi-core CPU system.
Parallelization: Provides C++ coroutines for independent executions.
The runtime system performs scheduling of virtual processor using OS kernel threads.
Requires uC++ compiler and uC++ kernel.