Saturday, September 20, 2008

Python vs. Cython vs. D (PyD) vs. C++ (SWIG)

In April 2008 there was a thread on the scipy-dev list regarding the inclusion of Cython code in SciPy. In that thread, I mentioned a particular use case of interest to me -- creating and manipulating an array of objects (rather than arrays of elementary data types) and being able to do that with Cython easily and efficiently.

The problem I was considering is a simple one. I create a list of "Vortex" objects and compute (naively) the velocity of a collection of these particles on one another. This is an O(N^2) computation since every particle influences every other. The idea was to create simple OO code to be able to perform these computations. Here is an outline of the Python code for doing this:

class Vortex(object):
def __init__(self, pos=0.0, strength=1.0):
# ...
def eval_velocity(self, pos):
return -1j*self.strength/(2*pi*(pos - self.position))

class VortexManager(object):
def __init__(self, vortices=None):
# vortices is a list of vortex objects.
self.vortices = vortices

def set(self, pos, str):
# ...

def velocity(self):
for va in self.vortices:
vel = complex(0, 0)
for vb in self.vortices:
if va is vb:
continue
else:
vel += vb.eval_velocity(va.position)
va.velocity = vel


Very straightforward code. Now, back in April I implemented this in pure Python, C++ and wrapped the C++ code to Python using SWIG. I also implemented it in D and wrapped that using PyD. I found that D was about 1.7 times slower than C++. C++ was about 300-400 times faster than the pure Python version.

I attended Robert Bradshaw's Cython tutorial at SciPy08 and really liked it. About 10 days ago I finally found the time to create a Cython version and the winner is ...

Cython!

I've put up all of the code here. To use the code, untar the tarball and do the following:

$ cd cpython_d_cpp
$ python setup.py build_ext --inplace

This requires SWIG, numpy and Cython to build. If you have PyD installed it will build the PyD extension also. To test the performance do this:

$ python test_perf.py

This produces the following output for me on a P4 (3 Ghz):

dee(4000): 1.87730193138
(1411.53285812+1411.53285812j) (945.091286479+945.091286479j)
swig(4000): 1.10782289505
(1411.53285812+1411.53285812j) (945.091286479+945.091286479j)
cython(4000): 1.15034103394
(1411.53285812+1411.53285812j) (945.091286479+945.091286479j)
Pure Python(200): 1.14771318436
# N SWIG Cython Ratio
1000 0.071 0.069 0.967
2000 0.283 0.274 0.968
3000 0.638 0.619 0.970
4000 1.135 1.100 0.970
5000 1.767 1.720 0.973
6000 2.517 2.473 0.983
7000 3.474 3.370 0.970
8000 4.541 4.403 0.970
9000 5.698 5.575 0.978
10000 7.000 6.879 0.983


The first few numbers just test one single case of 4000 particles. D is slower than both C++ and Cython. Python is dog slow (or donkey slow as I like to say it)! For some reason I was getting segfaults when I tried to test the D wrapper for more than 3000 particles. On my Macbook the Cython version is actually 30% faster than the C++ version and on a Sempron 2800 Cython is about 25% slower. So different machines produce different numbers. However, C++ and Cython are both in the same ballpark.

What I loved about the Cython code is that I use a Python list to manage the Vortex objects. This shows that we can use the normal Python containers to manage objects. This is extremely convenient. This isn't very surprising either since Python containers are also heavily optimized. "cython -a" was a huge help when getting things done. For more details on the Cython code look at the tarball.

Clearly, if you are building code from scratch and need speed, Cython is an excellent option. For this I really must congratulate the Cython and Pyrex developers.

6 comments:

Unknown said...

How do you explain that Cython was faster than C++?

I looked at your code, but it seems to me that the SWIG wrappers don't slow it down, e.g. you would not get any speedups if you wrapped the C++ using Cython, am I right?

If I remember well, you were opposing Cython quite a lot, so I am glad you changed your mind. :))

Ondrej

Prabhu Ramachandran said...

Well, there is a difference in implementation which may explain things. I am not using complex numbers in the Cython implementation (simply because Cython didn't support it).

I really am only checking the velocity, so there isn't the shadow class overhead. So it isn't SWIG. It could be the difference between STL vectors and Python lists.

The other issue is that you can't quite say Cython is always faster since on some machines it seems slower but yes, the results are a little surprising. I think a little more testing perhaps of a simple problem using a Python list and say a numpy array may be illustrative as well.

I really wasn't opposing anyone from using Cython. I was using Pyrex for certain things myself. All I was saying at that time was that if you needed containers and wanted OO code it didn't seem like Pyrex/Cython was a good choice. That has changed now. :)

Robert Kern said...

I've been working on reimplementing the Delaunay triangulation that’s currently in scikits.delaunay. I started with a Cython implementation using Point and Triangle classes. I was implementing a DAG structure of Triangle (called Delaunay Trees for some inexplicable reason). Each Triangle had its list of descendants stored as a Python list.

For this particular use case, I think the idea of using Python objects in Python containers breaks down. I noticed that I was getting much worse O() behavior than I should have been. Using the CPU Sampler in Instruments.app, I noticed all of my time was being spent in Python's garbage collection well before anything should be decref’ed down to 0. It turns out, the default allocator for Python types triggers gc.collect(). For a large data structure with many Python objects and an algorithm that creates objects at every step, the O() complexity will get completely hosed. Every object that you've already created will get traversed. Even if you replace your type’s allocators, allocating the Python containers will still trigger the garbage collecter. I had to completely disable garbage collection during my algorithm. That’s not so terrible, but still…

Fortunately, with my Cython prototype, it was relatively straightforward to recode the data structures in C with c-algorithms.sf.net to provide the necessary containers. Cython is still implementing the extension module, and the new buffer syntax is great for working with numpy arrays.

Gaël said...

I noticed that in _eval_velocity you were passing the output arguments by reference. Why is that?

david said...

I first thought it may be the memory allocation, but that's not it; the cython does more allocations than the c++ version according to valgrind.

I tried the C++ code directly in C++ to avoid the python overhead, but python overhead from swig is not great: it is around a factor 2/3.

Most of the computation seems to be done inside one function, which does not use the container in any 'funny' way so I don't think vector plays any role here: fftw cycle counter tells me that for 4000 particules, it takes 1e9 cycles, meaning ~ 16e6 calls to velocity, that is ~ 60 cycles per function call; the function call is likely to be the killer; it can't be inlined because of virtual. If I inline, I get a factor 2. I guess you could get another factor by bying a bit more clever, but then, it still means cython is at most one order of magnitude slower than micro-optimized C++, which I consider quite good.

Kirk said...

Ack. The mention of the Pyd version segfaulting sounds bad. I sure hope that's not my fault. :-)

I predict that the D version's relative slowness might have something to do with Pyd's somewhat awful handling of arrays (at least in part). More low-level and verbose, but possibly faster code could be written to compensate for this if this is indeed the problem. However, this is not a very attractive solution. (Optimally, Pyd should be capable of directly pointing D arrays at numpy arrays, but this is not actually implemented.)

Cython is a neat piece of work. I am not surprised at all to see it come in first in this benchmark.