Description
Hi! This is more of a little excursion than a 'true' issue, but it's about a technique which I've found useful and would like to share. The occasion is that I'm extending my library zimt to use highway's foreach_target mechanism.
My first remark - before I start out on the issue proper - is about this mechanism. I knew it was there, I thought it might be a good idea to use it, but the documentation was thin and I had a working solution already. Before I turned my attention to zimt again this autumn, I did a lot of reading in the SIMD literature, and I also decided to have a closer look at highway's foreach_target mechanism. Lacking extensive documentation, I sat myself down and read the code. Only then I realized just how well-thought-out and useful it actually is. Yes, using it is slightly intrusive to the client code, but you've really done a good job to hide the complexity and make it easy to 'suck' code into the SIMD-ISA-specific nested namespaces and dispatch to it. But here I do actually have some criticism - to figure that out I had to read and understand the code! It would have been much easier had there been some sort of technical outline, paper, or such - to explain the concept. This criticism goes beyond this specific topic - I think you'd be well advised to improve documentation, to address a wider user base.
My first step in introducing highway's multi-ISA capability into zimt was to introduce 'corresponding' nested namespaces in both my library's 'zimt' namespace and in the 'project' namespace (let's use this one for user-side code). I had a hard time initially figuring out the namespace scheme, probably because the name of the central macro HWY_NAMESPACE. The naming is unfortunate - of course it's a symbol for a namespace, but a name like HWY_SIMD_ISA would have hinted at it's semantics, not at syntax. With the namespaces set up, I could use foreach_target.h and dynamic dispatch. But I found the way to introduce the ISA-specific code via a free function verbose, so I tried to figure out a way to make this more concise and manageable. I 'bent' some of the code I used in lux to the purpose, and this is where I come to 'VFT dispatching'. The concept is quite simple:
- Instead of using free functions, I work with virtual member functions in a 'dispatch' object
- The dispatch object's base class has declarations of these as pure virtual static member functions
- In each ISA-specific nested namespace I inherit from the dispatch base class and provide the implementations
- I have a 'get_dispatch' routine (using highway's dynamic dispatch) which yields a dispatch base class pointer
- which does in fact point to the derived class with the ISA-specific implementations
So this is where the 'VFT' in 'VFT dispatching' comes from: it uses the virtual function table of a class with virtual functions. The language guarantees that the VFTs of all classes derived from the base have the same layout (otherwise the mechanism could not function). What do I gain? the base class pointer is a uniform handle to a - possible large - set of functions I want to keep ISA-specific versions of. Dispatching is as simple as calling through the dispatcher base class pointer, so once I have obtained it, it serves as a conduit:
// inside code submitted to foreach_target I have:
struct _dispatch
: public dispatch // inherit from dispatch base class
{
...
} ;
static const _dispatch local_dispatch ; // instantiate the derived class
const dispatch * const get_dispatch() // install the ISA-specific fetch routine
{
return & local_dispatch ; // return type will be cast to base class, as declared
}
// in the main body of code (in HWY_ONCE) I use highway's dynamic dispatch:
HWY_EXPORT(_get_dispatch);
const dispatch * const get_dispatch()
{
return HWY_DYNAMIC_DISPATCH(_get_dispatch)() ;
}
// with a member function 'foo' in the dispatcher, I can now dispatch like this:
auto dp = get_dispatch() ;
bar = dp->foo() ; // calls the ISA-specific variant
This is more or less it, with one more little twist which I also found useful. In a first approach, I wrote out the declaration of the pure virtual member function in the base class, and again the declaration (now no longer pure) in the derived, ISA-specific, class. This is error-prone, so I now use an interface header, introducing the member functions via a macro. In a header 'interface.h' I put macro invocations only:
// register int dummy ( float z ) for dispatch:
ZIMT_REGISTER(int, dummy, float z)
Then I can #include this header into the class declarations, #defining the macro differently:
// the dispatch base class is coded like this:
struct dispatch
{
#define ZIMT_REGISTER(RET,NAME,...) virtual RET NAME ( __VA_ARGS__ ) const = 0 ;
#include "interface.h"
#undef ZIMT_REGISTER
} ;
// the ISA-specific derived class (inside code submitted to foreach_target):
struct _dispatch
: public dispatch
{
#define ZIMT_REGISTER(RET,NAME,...) RET NAME ( __VA_ARGS__ ) const ;
#include "interface.h"
#undef ZIMT_REGISTER
} ;
// followed, inside the same nested namespace, by the definition(s)
int _dispatch::dummy (float z) const
{
return 23 ;
}
This ensures that the declarations are consistent. For the actual implementation, the signature has to be written out once again, but since there is a declaration, providing a definition with different signature is an error, and when providing the implementation, coding with the signature 'in sight' is advisable anyway - especially when the argument list becomes long. the 'interface.h' header provides a good reference to the set of functions using the dispatcher, and additional dispatch-specific functionality can be coded for the lot. I think it makes a neat addition to VFT dispatching.
To wrap up, I'd like to point out that this mechanism is generic and can be used to good effect for all sorts of dispatches - If appropriate specific derived dispatch classes are coded along with a mechanism to pick a specific one, it can function quite independently of highway's dispatch. It can also be used to 'pull in' code which doesn't even use highway - e.g. code with vector-friendly small loops ('goading') relying on autovectorization which will still benefit from being re-compiled several times with ISA-specific flags, be it with highway's foreach_target or by setting up separate TUs with externally supplied ISA-specific compiler flags - this is what I currently do in lux, but it requires quite some 'scaffolding' code in cmake.
Comments welcome! I hope you find this useful - I intended to share useful bits here every now and then and it's been a while since the last one (about goading), but better late than never. If you're interested, you can have a peek into zimt's new multi_isa branch, where I have a first working example using the method (see linspace.cc and driver.cc in the examples section). If you don't approve of my intruding into your issue space, let me know.
Activity