Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AARCH64] Fall back to GEMM if mkldnn_matmul fails #115936

Closed
wants to merge 2 commits into from

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Dec 15, 2023

  • Add call to at::globalContext().userEnabledMkldnn() to apply_mkldnn_matmul_heur
  • Surround calls to mkldnn_matmul with try {} catch {}
  • Print warning and fall back to BLAS (by calling at::globalContext().setUserEnabledMkldnn()) if mkldnn_matmul() fails

Test plan: On Linux arm run:

$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
  return F.linear(input, self.weight, self.bias)
tensor([[-0.5183,  0.2279, -0.4035,  ..., -0.3446,  0.0938, -0.2113],
        [-0.5111,  0.2362, -0.3821,  ..., -0.3536,  0.1011, -0.2159],
        [-0.6387,  0.0894, -0.7619,  ..., -0.1939, -0.0282, -0.1344],
        ...,
        [-0.6352,  0.0934, -0.7516,  ..., -0.1983, -0.0247, -0.1366],
        [-0.4790,  0.2733, -0.2862,  ..., -0.3939,  0.1338, -0.2365],
        [-0.5702,  0.1682, -0.5580,  ..., -0.2796,  0.0412, -0.1782]],
       grad_fn=<AddmmBackward0>)

Fixes #114750

Copy link

pytorch-bot bot commented Dec 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115936

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4a3390c with merge base 4ea7430 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: linalg_frontend release notes category label Dec 15, 2023
Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test this in CI? If so, can we add a regression test?

@malfet malfet added ciflow/binaries Trigger all binary build and upload jobs on the PR topic: bug fixes topic category ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR and removed ciflow/binaries Trigger all binary build and upload jobs on the PR labels Dec 15, 2023
@malfet
Copy link
Contributor Author

malfet commented Dec 15, 2023

@lezcano we can only test it in nightlies, but I will add it later (and do some manual testing right now)

@malfet malfet requested review from jgong5 and CaoE December 16, 2023 15:02
@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 16, 2023
@malfet
Copy link
Contributor Author

malfet commented Dec 16, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor Author

malfet commented Dec 16, 2023

@jgong5 FYI, is it safe to assume that mkldnn_matmul should never raise an exception and if it does, it safe to fallback to BLAS GEMM? Also, would be nice to check perf impact (should be near zero as try which do not raise are generally safe)

guilhermeleobas pushed a commit to guilhermeleobas/pytorch that referenced this pull request Dec 18, 2023
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling  `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails

Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
  return F.linear(input, self.weight, self.bias)
tensor([[-0.5183,  0.2279, -0.4035,  ..., -0.3446,  0.0938, -0.2113],
        [-0.5111,  0.2362, -0.3821,  ..., -0.3536,  0.1011, -0.2159],
        [-0.6387,  0.0894, -0.7619,  ..., -0.1939, -0.0282, -0.1344],
        ...,
        [-0.6352,  0.0934, -0.7516,  ..., -0.1983, -0.0247, -0.1366],
        [-0.4790,  0.2733, -0.2862,  ..., -0.3939,  0.1338, -0.2365],
        [-0.5702,  0.1682, -0.5580,  ..., -0.2796,  0.0412, -0.1782]],
       grad_fn=<AddmmBackward0>)
```
Fixes pytorch#114750

Pull Request resolved: pytorch#115936
Approved by: https://github.com/lezcano
@malfet malfet deleted the malfet/aarch64-fallback-to-openblas branch December 18, 2023 19:04
dmenig pushed a commit to dmenig/pytorch that referenced this pull request Dec 21, 2023
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling  `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails

Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
  return F.linear(input, self.weight, self.bias)
tensor([[-0.5183,  0.2279, -0.4035,  ..., -0.3446,  0.0938, -0.2113],
        [-0.5111,  0.2362, -0.3821,  ..., -0.3536,  0.1011, -0.2159],
        [-0.6387,  0.0894, -0.7619,  ..., -0.1939, -0.0282, -0.1344],
        ...,
        [-0.6352,  0.0934, -0.7516,  ..., -0.1983, -0.0247, -0.1366],
        [-0.4790,  0.2733, -0.2862,  ..., -0.3939,  0.1338, -0.2365],
        [-0.5702,  0.1682, -0.5580,  ..., -0.2796,  0.0412, -0.1782]],
       grad_fn=<AddmmBackward0>)
```
Fixes pytorch#114750

Pull Request resolved: pytorch#115936
Approved by: https://github.com/lezcano
@atalman atalman added this to the 2.2.0 milestone Jan 2, 2024
huydhn pushed a commit that referenced this pull request Jan 2, 2024
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling  `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails

Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
  return F.linear(input, self.weight, self.bias)
tensor([[-0.5183,  0.2279, -0.4035,  ..., -0.3446,  0.0938, -0.2113],
        [-0.5111,  0.2362, -0.3821,  ..., -0.3536,  0.1011, -0.2159],
        [-0.6387,  0.0894, -0.7619,  ..., -0.1939, -0.0282, -0.1344],
        ...,
        [-0.6352,  0.0934, -0.7516,  ..., -0.1983, -0.0247, -0.1366],
        [-0.4790,  0.2733, -0.2862,  ..., -0.3939,  0.1338, -0.2365],
        [-0.5702,  0.1682, -0.5580,  ..., -0.2796,  0.0412, -0.1782]],
       grad_fn=<AddmmBackward0>)
```
Fixes #114750

Pull Request resolved: #115936
Approved by: https://github.com/lezcano
huydhn added a commit that referenced this pull request Jan 3, 2024
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling  `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails

Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
  return F.linear(input, self.weight, self.bias)
tensor([[-0.5183,  0.2279, -0.4035,  ..., -0.3446,  0.0938, -0.2113],
        [-0.5111,  0.2362, -0.3821,  ..., -0.3536,  0.1011, -0.2159],
        [-0.6387,  0.0894, -0.7619,  ..., -0.1939, -0.0282, -0.1344],
        ...,
        [-0.6352,  0.0934, -0.7516,  ..., -0.1983, -0.0247, -0.1366],
        [-0.4790,  0.2733, -0.2862,  ..., -0.3939,  0.1338, -0.2365],
        [-0.5702,  0.1682, -0.5580,  ..., -0.2796,  0.0412, -0.1782]],
       grad_fn=<AddmmBackward0>)
```
Fixes #114750

Pull Request resolved: #115936
Approved by: https://github.com/lezcano

Co-authored-by: Nikita Shulga <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: linalg_frontend release notes category topic: bug fixes topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[aarch64] nn.Linear(20, 1) inference fails
4 participants