At Intel Labs, we’re working on a hybrid compiler/library approach to high-performance code generation, as presented in the TPP paper.
We use a Tile dialect to decompose ML “operations” into tile operations to then re-compose them into a hardware-efficient way (blocking/tiling/fusing), then calling hand-written micro-kernel libraries.
This scales well on CPUs because writing small tile kernels is easier than one large kernel for every combination of high-dimensional shapes, and calling micro-kernel libraries are efficient. But on GPUs, this is not so trivial.
Essentially, for GPUs (and similar devices) you need to re-compose the micro-kernels into larger kernels to offload whole-computation, taking advantage of the hardware threads and units inside the larger kernel, and that needs cooperation between compiler and micro-library writers, ABI agreements for register reuse, etc.
After discussing with different teams inside and outside of Intel, on forums and at the MLIR Hackathon in Edinburgh, we have a proposal for a tile dialect that could work for both types of devices (and we hope many more). However, the proposal isn’t precise, because there is no “right way” of doing it, and either way, we’ll need buy-in from the community.
This RFC’s intent is to gather the direction other groups working on the same problems feel is the most productive. So we’d welcome the feedback from everyone here.
Rationale
The core idea is that high-level ML and HPC problems are described in “problem space” while efficient hardware implementation is written in “architecture space”. These are already different for a single problem and a single architecture, let alone multiple problems on different architectures.
The Tile dialect’s main objective is to allow seamless decomposition and re-composition of these problems as operations on tiles
. A tile could be different for CPUs and GPUs and even across different CPUs, but it’s usually a 2D structure because you usually have N
registers with M
lanes each to represent your inner computation. Broadcast and reduction operations change that a bit but that’s irrelevant for the current discussion.
Most importantly, tile operations can be executed in parallel by the hardware. For example, a reducing GEMM can be considered a tile operation if it reduces the K dimension, then getting fused with a bias add later on.
In a nutshell:
- Each Tile operation represents a computation on a 2D “tile”
- Element wise, matmul, broadcast, reduction, specialized (ex. ReLU), etc.
- Allow decomposition from “problem space” into parallel tile ops
- Allow re-composition to “architecture space” and into larger kernels (CPU/GPU)
- Choosing the tile sizes, loop order, fusion opportunities is a job for the compiler
- Implementing the most efficient micro-kernel for a specific architecture is a job for the library/vectorizer
- Compiler and libraries need to talk to each other to converge to a global optimal solution
Current State
We have created a TPP
tile dialect and an XSMM
micro-kernel dialect to represent these concepts.
We don’t claim they are the most optimal way for either problem, but they do allow us to work on the problems and get good performance out of it. We seek consensus on what the most optimal way is, so that we can continue pushing the community for the best way forward, not just one that “works for us”.
Our current prototype has the following path:
- Lower
StableHLO
/Torch
/TCP
into [linalg+arith/math
]
- Tile into [
SCF
and linalg+arith/math
] (decomposition)
- Lift to [
SCF
and TPP
] and fuse (re-composition)
- Lower to
XSMM
/OpenMP
and LLVM
calls
In GPUs, the last step would need instead to use GPU threads and bundle the whole parallel loop into a kernel to offload. We’re working on a prototype for that now.
The main problems with this approach are:
-
linalg
is great for tiling and fusing, but it mandates strongly nested loops, which hinders fusion at different nest levels, for example, and element-wise after the last reduction of a GEMM tile.
-
linalg+arith/math
is not helpful for pattern-matching (re-composition), so we need to lift from linalg+arith/math-on-scalars
to named tile ops (our TPP dialect).
- The end result of a tiled
linalg
is a set of SCF
loops (for
, forall
, parallel
) and more linalg
inside, so we have to go to SCF
anyway.
- Lowering to
linalg
too early means forcing a particular implementation detail that may not be optimal in the target architecture, because linalg
uses scalar arith/math
inside its regions. For example, a ReLU can be lowered to generic { maxf i,0 }
or generic { cmp; sel }
when the target has a better implementation for one or the other, or neither.
To avoid lowering and lifting, and being able to use SCF
loops from the beginning, we could have an extension of a tensor dialect with named ops (like TCP
) that gets tiled directly to tile ops if possible, and linalg generics if not.
This uses the “best of both worlds”, but needs invasive changes to many upstream dialects.
This alternative process (with tensor ops dialect) would be:
- Lower
StableHLO
/Torch
/TCP
into [tensor ops on ND-tensors
and linalg+arith/math
]
- Tile into [
SCF
and tensor ops on tiles
(linalg
tiles separately)] (decomposition)
- Optional lifting from
tiled linalg
if needed
- Reorg to [
SCF
and tensor ops on tiles
] and fuse (re-composition)
- Lower to kernel / micro-kernel calls, vectorize, group for GPUs, etc.
Reshuffle of Dialects
For a while there has been discussions to remove tensor
and memref
support from arith
and math
, and that linalg
is gaining too many named ops while that’s not the original design.
Recently we agreed to a TCP
dialect that converges StableHLO
, Torch
and other ingress dialects. On its own, it can serve as a common ingress, but may prove hard to justify yet-another-partial-conversion route if it does not provide a value beyond convergence.
TOSA
has a lot of the required functionality (ND shapes, named ops, strict semantics) but it’s not an upstream dialect in the sense that its development is controlled by a separate group that needs to certify every change. In essence, it serves as a very stable exchange format, not so much as an upstream development and performance dialect.
So, to avoid lowering and raising IR, we’d need a dialect that:
- Can represent named operations on N-dimension tensors/memrefs
- Implements
TileInterface
, imply specific affine maps and parallel/reduction iteration
- Have passes that can easily match specific dimensions as tiles (0~2D)
- Have the same semantics as
arith
and math
, but on tensors
- Have additional operations that are common to ML/HPC problems
- Have a direct mapping between ingress dialects and tile
- For everything else we use
linalg
generics
Potential Solutions
TCP
The most appealing to me is to use TCP
. It already has most of those features, but not yet the whole intention, so it would need some re-thinking if we want to go that way. But creating another Tensor
dialect would conflict with the existing tensor
dialect and lower the appeal of TCP
.
This is what I understood initially, but it seems the group implementing it had different ideas. If this post helps convince to extend the semantics, great. If not, we can do with some other dialect.
TensorArith / TensorMath
Basically arith
and math
for tensor
and memref
. I don’t like this.
Or bundle into a single “TCP
but for tiling”, if the TCP
folks don’t think it’s a good idea to use TCP
for tiling. Reduces the appeal of TCP
.
Tensor
We could also extend tensor
to have all arith
and math
ops and more (ML/HPC) and bloat it a lot, which would also mean we need the same for memref
(vector
too?), so this to me is the least appealing one.
It would create a mismatch between arith+math
on scalar+vector
and tensor
and memref
with shape ops plus a whole set of arithmetic and maths and ML and HPC. To me, tensor
, memref
and vector
are type dialects.
Arith / Math / ML / HPC
Formalise arith
, math
, expand into new dialects for ML
and HPC
dialects on scalars and tensors. May not be possible.
This is a long road and has been rejected in the past a few times because the semantics isn’t consistent across architectures like “higher-level ML operations” are.
We could just get away with ML
and HPC
here, same as TensorMath
above.
TPP
Continue on our current route to have a “tile only” dialect, and continue to lift from linalg+scalar
into tile operations. This is working so far, but we have an increasing number of pattern matchers, to find an increasing number of patterns in the wild.
Better Ideas?
As I said at the start, these are just some ideas (and not even great ones). I may be missing something completely obvious, so I’m seeking honest feedback.
From the people I talk to, the code I read elsewhere, people are more or less struggling with the same problems we are, so if someone has a better solution, a lot of people will be very happy! 
@mehdi_amini @sanjoyd @nicolasvasilache @jpienaar @stellaraccident @TobiasGrosser