Skip to content

Adding support for ppc64le #892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Feb 5, 2025
Merged

Conversation

WillTrojak
Copy link

This adds F32 and F64 support for PowerPC LE 64, by building on some initial work by @breuera.

Features:

  • Support for VSX instructions seen on power9 and older.
  • Support for MMA instructions on power10+.
  • FP32 and FP64

Not currently supported:

  • BF16/FP16, although some of the initial steps have bee taken. I don't currently know the best way to pack the vectors required for power10 ger instructions.
  • Transpose operation as I'm not sure of the best strategy to take.

@WillTrojak
Copy link
Author

@FreddieWitherden @joseemoreira @MichaelMeissner you might like to take a look at this.

@WillTrojak WillTrojak changed the title Feature/power Adding support for ppc64le Aug 16, 2024
@FreddieWitherden
Copy link
Contributor

In the code generator there are quite a few large switch/case blocks. These can be eliminated by repurposing some of the unused bits in the instruction value to indicate the type of the instruction. Then checking the form of an instruction becomes a simple bitwise operation rather than a massive switch/case block. This is a trick which is used by both the x86 and ARM generators to improve performance and simplify the code.

Also, please check everything compiles as C89 (no // comments and all variables must be defined at the start of a block).

@alheinecke
Copy link
Collaborator

Thanks for your contribution. I'm with @FreddieWitherden here in terms of potential for optimization.

I also suggest we merge for now into a branch "feature_ppc64le" and the we also get CI up and running before merging into main. Is there a chance that we could run CI in IBM cloud in a similar way as we use GVT3 in AWS? Would you folks be able to provide credits for this?

@breuera
Copy link
Contributor

breuera commented Aug 19, 2024

Is there a chance that we could run CI in IBM cloud in a similar way as we use GVT3 in AWS? Would you folks be able to provide credits for this?

Here is an option which was announced today: https://lists.osuosl.org/pipermail/openpower/Week-of-Mon-20240819/000110.html

@WillTrojak
Copy link
Author

I'll investigate the IBM cloud power10 access for CI. Also I found a bug for FP64 when m and n are small, k is large, and k % 4 == 0. I'll fix this later today.

@WillTrojak
Copy link
Author

Getting IBM Cloud credit may be difficult, I could get a power10 server span up but I'm not sure how long it would be available for. The option pointed out by @breuera I think might be best. Here are some further links:

breuera and others added 23 commits September 9, 2024 18:59
Added initial FP32-POWER GEMM-microkernel.
Rmeoved trailing white spaces.
Increased reuse-distance of GPRs w.r.t. VSX-loads and -stores.
Added wrapper for power fixed-point compare.
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
Signed-off-by: WillTrojak <[email protected]>
@WillTrojak
Copy link
Author

@FreddieWitherden I embedded a form ID in the 32 and 64 bit opcodes, and did some other clean up. It reduced the line count a fair bit.
I still need to:

  • Add the automatic blocking stuff, although I have a vaguely working prototype.
  • Find why some kernels fail when m and n are small and k is big for fp64.

@rengolin
Copy link
Contributor

Getting IBM Cloud credit may be difficult, I could get a power10 server span up but I'm not sure how long it would be available for. The option pointed out by @breuera I think might be best. Here are some further links:

Hi @WillTrojak, thanks for the work on PPC! I have submitted the form above and listed you as the advocate.

I agree with Alex, we can easily merge this into a branch, create a CI loop in Github and work there. When we're happy, we merge to main. We do this with other feature branches, too.

@WillTrojak
Copy link
Author

@rengolin i actually already have a power10 server via OSUOSL for libxsmm. Email me and I can add you to the instance to set up the CI stuff. I found with OpenBLAS, that the performance was 10-20% lower than on the bare metal as they seem to be running some OpenShift cluster.

Sorry about being so slow on this PR, I have a number of updates I was going to push in the next few days which Improve the performance quite a bit.

I’m just testing and solving some regression issues with power9.

@rengolin
Copy link
Contributor

@rengolin i actually already have a power10 server via OSUOSL for libxsmm. Email me and I can add you to the instance to set up the CI stuff.

Oh, it seems the process has started already. Perhaps you'll be contacted?

I found with OpenBLAS, that the performance was 10-20% lower than on the bare metal as they seem to be running some OpenShift cluster.

Yeah, I'm not too worried about performance numbers in a VM, it won't be a constant factor anyway, so mostly caring about conformance in this CI loop. If you have access to a bare metal machine, you can run benchmarks and report the numbers on the PR, that should be fine.

@WillTrojak
Copy link
Author

@rengolin I pushed an update. I'm pretty happy with the state of the kernels now being produced, especially for FP32. I've spoken to @FreddieWitherden and we have a plan for spare kernels, but I'll sort this in a separate PR.

@rengolin rengolin changed the base branch from main to powerpc February 5, 2025 14:01
@rengolin
Copy link
Contributor

rengolin commented Feb 5, 2025

Ok, so I changed the base branch to https://github.com/libxsmm/libxsmm/tree/feature_powerpc until we work on funcionality, but we'll need to rebase to main soon if we want to continue working on this.

If you have more code on this, please submit PRs against the powerpc branch for now and have a plan to rebase. I'll run the CI loop from our side, to make sure nothing was accidentally broken in the generic code, and we use that branch to add a new Github Actions hook to use your builder.

@rengolin rengolin merged commit 3c8534f into libxsmm:powerpc Feb 5, 2025
@WillTrojak
Copy link
Author

WillTrojak commented Feb 5, 2025

Great. I've been working on rebasing it. When I run the code before the rebase everything's fine, but after the rebase it segfaults. Did something change in the way arguments are passed to kernels? There seem to have been some changes in libxsmm_dispatch_gemm? I say this because the generated code is identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants