At Intel Labs, we’re working on a hybrid compiler/library approach to high-performance code generation, as presented in the TPP paper.

We use a Tile dialect to decompose ML “operations” into tile operations to then re-compose them into a hardware-efficient way (blocking/tiling/fusing), then calling hand-written micro-kernel libraries.

This scales well on CPUs because writing small tile kernels is easier than one large kernel for every combination of high-dimensional shapes, and calling micro-kernel libraries are efficient. But on GPUs, this is not so trivial.

Essentially, for GPUs (and similar devices) you need to re-compose the micro-kernels into larger kernels to offload whole-computation, taking advantage of the hardware threads and units inside the larger kernel, and that needs cooperation between compiler and micro-library writers, ABI agreements for register reuse, etc.

After discussing with different teams inside and outside of Intel, on forums and at the MLIR Hackathon in Edinburgh, we have a proposal for a tile dialect that could work for both types of devices (and we hope many more). However, the proposal isn’t precise, because there is no “right way” of doing it, and either way, we’ll need buy-in from the community.

This RFC’s intent is to gather the direction other groups working on the same problems feel is the most productive. So we’d welcome the feedback from everyone here.

Rationale

The core idea is that high-level ML and HPC problems are described in “problem space” while efficient hardware implementation is written in “architecture space”. These are already different for a single problem and a single architecture, let alone multiple problems on different architectures.

The Tile dialect’s main objective is to allow seamless decomposition and re-composition of these problems as operations on tiles. A tile could be different for CPUs and GPUs and even across different CPUs, but it’s usually a 2D structure because you usually have N registers with M lanes each to represent your inner computation. Broadcast and reduction operations change that a bit but that’s irrelevant for the current discussion.

Most importantly, tile operations can be executed in parallel by the hardware. For example, a reducing GEMM can be considered a tile operation if it reduces the K dimension, then getting fused with a bias add later on.

In a nutshell:

  • Each Tile operation represents a computation on a 2D “tile”
  • Element wise, matmul, broadcast, reduction, specialized (ex. ReLU), etc.
  • Allow decomposition from “problem space” into parallel tile ops
  • Allow re-composition to “architecture space” and into larger kernels (CPU/GPU)
  • Choosing the tile sizes, loop order, fusion opportunities is a job for the compiler
  • Implementing the most efficient micro-kernel for a specific architecture is a job for the library/vectorizer
  • Compiler and libraries need to talk to each other to converge to a global optimal solution

Current State

We have created a TPP tile dialect and an XSMM micro-kernel dialect to represent these concepts.

We don’t claim they are the most optimal way for either problem, but they do allow us to work on the problems and get good performance out of it. We seek consensus on what the most optimal way is, so that we can continue pushing the community for the best way forward, not just one that “works for us”.

Our current prototype has the following path:

  • Lower StableHLO/Torch/TCP into [linalg+arith/math]
  • Tile into [SCF and linalg+arith/math] (decomposition)
  • Lift to [SCF and TPP] and fuse (re-composition)
  • Lower to XSMM/OpenMP and LLVM calls

In GPUs, the last step would need instead to use GPU threads and bundle the whole parallel loop into a kernel to offload. We’re working on a prototype for that now.

The main problems with this approach are:

  1. linalg is great for tiling and fusing, but it mandates strongly nested loops, which hinders fusion at different nest levels, for example, and element-wise after the last reduction of a GEMM tile.
  2. linalg+arith/math is not helpful for pattern-matching (re-composition), so we need to lift from linalg+arith/math-on-scalars to named tile ops (our TPP dialect).
  3. The end result of a tiled linalg is a set of SCF loops (for, forall, parallel) and more linalg inside, so we have to go to SCF anyway.
  4. Lowering to linalg too early means forcing a particular implementation detail that may not be optimal in the target architecture, because linalg uses scalar arith/math inside its regions. For example, a ReLU can be lowered to generic { maxf i,0 } or generic { cmp; sel } when the target has a better implementation for one or the other, or neither.

To avoid lowering and lifting, and being able to use SCF loops from the beginning, we could have an extension of a tensor dialect with named ops (like TCP) that gets tiled directly to tile ops if possible, and linalg generics if not.

This uses the “best of both worlds”, but needs invasive changes to many upstream dialects.

This alternative process (with tensor ops dialect) would be:

  • Lower StableHLO/Torch/TCP into [tensor ops on ND-tensors and linalg+arith/math]
  • Tile into [SCF and tensor ops on tiles (linalg tiles separately)] (decomposition)
  • Optional lifting from tiled linalg if needed
  • Reorg to [SCF and tensor ops on tiles] and fuse (re-composition)
  • Lower to kernel / micro-kernel calls, vectorize, group for GPUs, etc.

Reshuffle of Dialects

For a while there has been discussions to remove tensor and memref support from arith and math, and that linalg is gaining too many named ops while that’s not the original design.

Recently we agreed to a TCP dialect that converges StableHLO, Torch and other ingress dialects. On its own, it can serve as a common ingress, but may prove hard to justify yet-another-partial-conversion route if it does not provide a value beyond convergence.

TOSA has a lot of the required functionality (ND shapes, named ops, strict semantics) but it’s not an upstream dialect in the sense that its development is controlled by a separate group that needs to certify every change. In essence, it serves as a very stable exchange format, not so much as an upstream development and performance dialect.

So, to avoid lowering and raising IR, we’d need a dialect that:

  • Can represent named operations on N-dimension tensors/memrefs
  • Implements TileInterface, imply specific affine maps and parallel/reduction iteration
  • Have passes that can easily match specific dimensions as tiles (0~2D)
  • Have the same semantics as arith and math, but on tensors
  • Have additional operations that are common to ML/HPC problems
  • Have a direct mapping between ingress dialects and tile
  • For everything else we use linalg generics

Potential Solutions

TCP

The most appealing to me is to use TCP. It already has most of those features, but not yet the whole intention, so it would need some re-thinking if we want to go that way. But creating another Tensor dialect would conflict with the existing tensor dialect and lower the appeal of TCP.

This is what I understood initially, but it seems the group implementing it had different ideas. If this post helps convince to extend the semantics, great. If not, we can do with some other dialect.

TensorArith / TensorMath

Basically arith and math for tensor and memref. I don’t like this.

Or bundle into a single “TCP but for tiling”, if the TCP folks don’t think it’s a good idea to use TCP for tiling. Reduces the appeal of TCP.

Tensor

We could also extend tensor to have all arith and math ops and more (ML/HPC) and bloat it a lot, which would also mean we need the same for memref (vector too?), so this to me is the least appealing one.

It would create a mismatch between arith+math on scalar+vector and tensor and memref with shape ops plus a whole set of arithmetic and maths and ML and HPC. To me, tensor, memref and vector are type dialects.

Arith / Math / ML / HPC

Formalise arith, math, expand into new dialects for ML and HPC dialects on scalars and tensors. May not be possible.

This is a long road and has been rejected in the past a few times because the semantics isn’t consistent across architectures like “higher-level ML operations” are.

We could just get away with ML and HPC here, same as TensorMath above.

TPP

Continue on our current route to have a “tile only” dialect, and continue to lift from linalg+scalar into tile operations. This is working so far, but we have an increasing number of pattern matchers, to find an increasing number of patterns in the wild.

Better Ideas?

As I said at the start, these are just some ideas (and not even great ones). I may be missing something completely obvious, so I’m seeking honest feedback.

From the people I talk to, the code I read elsewhere, people are more or less struggling with the same problems we are, so if someone has a better solution, a lot of people will be very happy! :smile:

@mehdi_amini @sanjoyd @nicolasvasilache @jpienaar @stellaraccident @TobiasGrosser

1 Like

Broadly makes sense to me. As was discussed on the IREE mailing list, the conversation got a little sideways because it picks up what a pipeline like what you are describing produces (and has some in tree microkernels that take a more speedy path to lowering).

Further building out the machinery to scale such things makes sense to me.

I just wish someone would create a repo and put all of the bits in it for such a thing. The situation we have where some of it is in a branch on torch-mlir, some is in tpp, some bits upstream… Could use improvement. Seems like a mkdir problem.

I’m mostly commenting on this side topic. Is the above understanding still true, given OpenXLA and StableHLO? i.e., when you have OpenXLA and StableHLO, do you even need TCP? When TensorFlow, PyTorch, and JAX have (or will have) stable community-driven paths to StableHLO, “TCP” is effectively StableHLO. I don’t see what goal with the planned TCP isn’t met by StableHLO. I think it’s a great accomplishment that OpenXLA StableHLO happened, and I hope, like many, to see continued efforts to maintain the paths into and out of it.

I might have misunderstood what “Recently” above is – so if there is a post-OpenXLA document/discussion elsewhere, please point me to it to avoid repetition.

With the obvious bias I have on the stablehlo side, I’ll stop short of making a recommendation, but I think this is the right question to be asking over the coming months. This gap in the ecosystem has existed for a long time, and stablehlo looks on a path to fix it. There has also been measurable progress made in extending it, which starts to answer the main anxiety I had about it earlier this year (ie. Is it too stable?)

Replaying conversations from the past, I think that there are three axes of different among some:

  1. Governance (LLVM vs Google/OpenXLA evolving in the future)
  2. Placement (MLIR upstream vs other)
  3. Technical (do we need a second level “transformation oriented” dialect at the same abstraction level?)

On the first, OpenXLA is behind but has aspirations. Within OpenXLA, StableHLO is the further ahead pushing the envelope here.

On the second, it has been pretty clear to me that the community is split as to how much upstream MLIR should itself be the place for such development. For myself, I’ve decided that it is not worth the ROI to continue this discussion, and an out of tree ecosystem for domain specific use of MLIR is the way to go. I’d far rather have some more repositories in the world and shuffling over time vs continuing development with the hand brake on trying to define this relationship within MLIR itself.

On the third, there is still debate. Part of the community believes that stablehlo is a serialization dialect suitable only for interchange and part see it as fine to be transformations on. I personally have evolved on this viewpoint and am in the second category now. The reasons I’ve evolved are because (a) significant investment has shown this to be a practical reality to live in, and (b) if you look beyond the names at the dialects and layering, you find that the stablehlo dialect is very similar to LLVM-IR in terms of evolution contract (the “stable” part of it actually comes from its serialization layer which interops with the actually “unstable” stablehlo dialect).

I’ve got my own perspective on this, but I think that is a faithful rendering of the debate points.

I think this is all a minor point related to the overall Tile work. That seems like a thing that should exist, but I’d have the same question about just putting it in upstream MLIR as I have with anything.

That’s how I see it too, but honestly I’m not involved in ingress dialects discussions, so it might just be my own bias. That is why I originally thought TCP to be a middle in between SHLO and Linalg for optimization purposes, but that doesn’t seem to be the current approach.

You got it right, I meant pre-OpenXLA.

Exactly, I don’t intend to start that either. To me, as long as it’s an independent source (ie. I don’t have to import a whole project to get a dialect) I’m ok with it. The governance discussion is more important, but as long as the dialect remains “open”, it should also be “good enough”.

That’s what I really want to talk about.

The “Tile” part of our work is crucial for the kind of transformations we do, but on its own, it overlaps with too many different dialects (from StableHLO, through Linalg, to Arith/Math) and this is what I’m trying to understand.

If there is a desire to fix this middle-ground, we’ll continue to work on our tile dialect downstream while working with the community (MLIR and OpenXLA) to fix the dialect landscape.

If not, we’ll propose a more self-contained Tile dialect that is lifted from Linalg generics for pattern matching, and restrict its semantics to facilitate pattern matching for decomposition, re-composition and fusing.

This matches my original expectation to use TCP as this tensor operation dialect, as I understood StableHLO to be only a serialization dialect.

Now, if StableHLO becomes a transform dialect, it will have to carry information in a way that does not tie itself too early to the target hardware, allows tiling and fusing at tensor level and connect to a micro-kernel dispatch dialect that allows architecture-aware fusion. This is a tall order for a dialect that is also a serialization dialect.

What I mean by “tie itself too early” is the following:

  • All high/mid/low level transforms need to understand the hardware features to pick the right transforms (multi-level cost model).
  • But that cannot be a different pass for each level for each hardware, or we end up with a combination of different compilers.
  • So we need a generic dialect and transforms that take into consideration target features at each level and the possible lowering in the next stage.

This is close to what LLVM does for code generation, but here, we go from graphs to large ops on basic blocks, to long list of deeply nested loops, to fusion at different nest levels, to a long stream of parallel loops and kernel calls, to potentially large kernel fusions and outlining, etc.

The shape of the IR changes radically with each layer and we’ll need not only different cost models and types of target information, but implement different interfaces and simplify pattern matching at different levels.

Having all that in one dialect is obviously possible, but I personally wouldn’t vouch for it as a first approach.

I agree, but we may have to do both

If we want to separate concerns, avoid redundancy and clean up the upstream dialects (which I think there are too many), we need to make sure we keep the minimal support in MLIR proper, and move the rest to individual repositories with their own repositories, either in LLVM umbrella or not.

(We’d also need a good infrastructure to easily import the right dialects and dependencies, which iw not trivial).

For now, following your point above, we ignore the upstream MLIR mess, extract our dialects (tpp, xsmm) into a separate repository that isn’t tied to tpp-mlir (potentially rename it to tile and ukernel or something), and then we iterate on it and see how it connects to StableHLO from above.

If it becomes clear that StableHLO is getting bloated with too many interfaces, we can propose a tensor operation dialect to cover the gap (on yet-another separate repo). If later on we find a way to either raise Tile or lower StableHLO, we fuse the middle dialect with the right one, etc.

IIUC this is what you were suggesting, right?

Yeah, neither would I. I think that I would build the transformation dialect that you need to support a lowering path vs trying to acquire a global design lock. I just think that “suitable for transformations” is a misnomer and we shouldn’t be shying away from intermediate, utility dialects that aid a particular compilation pathway (like what you are proposing with Tile).

If you’ll let me get wonky for a minute, I think that when a project like MLIR experiences its “big bang” moment and the temperature finally settles enough to produce the initial matter, that first burst of creation is of a different type than what follows. I’m treating the upstream dialects like that first burst: they condensed out of the microwave background and established the early structure of the universe, but what follows is much more incremental and “clumpy”, with change and exchange over time.

I believe this was where my metaphor was headed, yes, but I think I’m advising to not shy away from making an island-project that has everything you need to go from a good point A to a good point B vs focusing on where each internal piece lives. Get customers and integrations, and then look at breaking apart. Along the way, of course, do your best to make the general structure something that we pattern match as “production compiler”. If you build things with good internal separation of concerns, it is easier to identify common pieces later and move them to the right clump.

This is basically the evolution process that StableHLO itself took: mlir-hlo included a whole bunch of stuff, but the mhlo dialect itself clearly had a high value to a significant portion of the ecosystem. So it was forked out, stripped down to the bolts, and had features added to meet the ecosystem requirements that the original project charter was not prepared to take on. And now we have a “clump” that seems positioned to provide some nice value.

Trust me, I know it is hard to think about heading out on such a journey, but we’ve found it valuable.

I absolutely support various sub-communities to develop their system out-of-tree (like OpenXLA), for ROI reasons or other anchor point.
I am very concerned however when these efforts are used as reasons to prevent similar effort that other community members are interested to driving within MLIR though.

Yup, this is what we’re doing in tpp-mlir.

Awesome! I think we’re on the same page here.

Right, this thread’s intent was to get feedback from the community if that was a good idea, or if there was another effort that is happening already elsewhere that we needed to be aware of.

Yup. As you know, we’re already working with IREE. We’re also working with GPU teams to extend the idea, and we’re getting a minimal GPU pipeline on our side, that later on we want to glue on to something like IREE.

The IREE integration is our way to keep this in check. We found some minor teething issues (that we were already expecting), nothing substantial.

We’ll keep “raising” tile to the point where it borders StableHLO. Once we do, the contact surface will be much clearer.

This thread is exactly the entry point of that community discussion. We can continue on our own until it matures, but we want to participate on upstream MLIR discussions of the impact of our work on the general MLIR ecosystem.

We want to organically change the existing MLIR dialects if the evidence is clear that it’s the right time / direction. We have already added features to linalg and tensor in that respect and we’ll continue working with the community to improve more.

But now that we’re thinking of having potentially higher dims than just tile (to do tile-and-fuse), we’re reaching the arith / math on tensors discussion that may have matured upstream to find a better path.

If it does, we want to be part of that discussion. If not, we can continue downstream until there’s more evidence / will, or until we just merge a new dialect that everyone wants anyway.

2 Likes

If we substitute “LLVM” for “MLIR” then I basically agree with you. I just don’t see MLIR itself as an island that should be expanded indefinitely in isolation as a first order principle.

I do think it is reasonable for the whole community to make any decision when it comes to scope/charter expansion:

  • There is an unmet need that is sufficiently concrete and supportable as a direct development effort under LLVM.
  • There is an existing part of the ecosystem that would be valuable to bring into LLVM and continue development.
  • There is an existing part of the ecosystem that we think would be better developed within LLVM and we should set out to do so.

People will always disagree on those things in any given scenario, and I think it is important that when there is a conflict of interest by a party to the discussion, it is well acknowledged so that consensus of the community can balance that into its evaluation.

No party should be able to block something because it exists elsewhere. But it is also reasonable for the community as a whole to decide that on balance, duplication may not be warranted at a given point. In such cases, there will always be conflicts of interest, and I think it is important that those are taken into account vs silenced. In fact, some of the most valuable perspectives on such things probably come from people with a conflict (because users of an alternative are in the best position to offer first hand perspectives).

I’m being explicit not to argue but because, aside from that very first point, I think we agree and I’m cross checking vs debating.

1 Like

Evolving upstream abstractions with downstream experience is great. I don’t think there should be a high bar on such evolution. I think it is the main way that existing things move forward.

1 Like