Skip to content

Fix pytorch fbgemm.dll dependency issue #2927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

btbest
Copy link
Contributor

@btbest btbest commented Nov 7, 2024

Issue we first saw in #2904 with pytorch 2.4.1

E.g. https://github.com/ilastik/ilastik/actions/runs/11721367647/job/32654563102

Traceback (most recent call last):
  File "C:\Users\runneradmin\miniconda3\conda-bld\ilastik-gpu_1730976400239\test_tmp\run_test.py", line 23, in <module>
    import torch
  File "C:\Users\runneradmin\miniconda3\conda-bld\ilastik-gpu_1730976400239\_test_env\lib\site-packages\torch\__init__.py", line 262, in <module>
    _load_dll_libraries()
  File "C:\Users\runneradmin\miniconda3\conda-bld\ilastik-gpu_1730976400239\_test_env\lib\site-packages\torch\__init__.py", line 245, in _load_dll_libraries
    raise err
OSError: [WinError 182] The operating system cannot run %1. Error loading "C:\Users\runneradmin\miniconda3\conda-bld\ilastik-gpu_1730976400239\_test_env\lib\site-packages\torch\lib\fbgemm.dll" or one of its dependencies.

Conclusion:

I've put the pytorch <2.4.1 constraint in meta.yaml, and also environment-dev.yml, so we should effectively be on 2.4.0 now.

The error we see on Windows CI with 2.4.1 and up may look identical to an error widely described for pytorch 2.4.0 (pytorch/pytorch#131662), but I think they are separate issues. The error looks identical because in both cases, a dll related to openmp is missing/broken.

Evidently, unlike a huge number of people as documented by the pytorch issue, ilastik seems to be fine with pytorch 2.4.0. I suspect everyone else was having openmp ruined by torchaudio, which we don't pull.

In #2904, @k-dominik describes a clash between pytorch 2.4.1 and pyshtools due to its dependency gmt overwriting openmp. An attempted workaround with a modified pyshtools package, uploaded to the conda channel ilastik-forge/label/patched, may have slimmed the dependencies but didn't solve the problem.

In my attempts here, pyshtools was also the source of the problem; but instead of gmt, it looks like openmp was being overwritten by libflang. Specifically, it was pulling openmp and libflang both at versions 5, which are rather old. I tried if pytorch 2.5.1 was able to coexist with other openmp versions in an env, and looks like openmp=8.0.1 would have been fine. libflang=11.* pulled llvm-openmp instead of openmp, with the overwriting issue persisting, but libflang=17.* no longer pulled any openmp and was fine as well.

Copy link

codecov bot commented Nov 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.47%. Comparing base (dfef849) to head (12839a7).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2927      +/-   ##
==========================================
- Coverage   56.47%   56.47%   -0.01%     
==========================================
  Files         535      535              
  Lines       62263    62263              
  Branches     7711     7711              
==========================================
- Hits        35165    35164       -1     
+ Misses      25335    25334       -1     
- Partials     1763     1765       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@k-dominik
Copy link
Contributor

do you have a link to a source for "because libomp140 was no longer shipped in a VS redistributable.", or did you diff different VS redistributables yourself?

@btbest
Copy link
Contributor Author

btbest commented Nov 12, 2024

Progress: After verifying that a barebones pytorch install works fine (also on github windows runners), I've been able to find the culprit package that we install and with which it breaks: sphericaltexture... At least I can finally reproduce this problem on my machine

@k-dominik
Copy link
Contributor

do you install the patched version of pyshtools from ilastik-forge/label/patched ?

@btbest
Copy link
Contributor Author

btbest commented Nov 12, 2024

Found the culprit; the patched pyshtools didn't fix it. Installing sphericaltexture pulls openmp 5.0.0. 5.0.0 is sufficient to cause the issue even with pytorch 2.5.1. Installing 8.0.1 allows pytorch 2.5.1 to work fine.

@btbest
Copy link
Contributor Author

btbest commented Nov 13, 2024

(torchtest) >conda-tree whoneeds openmp
libflang

(torchtest) >conda-tree whoneeds libflang
pyshtools

(torchtest) >conda-tree depends pyshtools
astropy
cartopy
ducc0
fftw
libblas
libflang
liblapack
libpgmath
matplotlib-base
numpy
pooch
python
python_abi
requests
scipy
ucrt
vc
vc14_runtime
xarray

But libflang doesn't show up in the shtools repo

@k-dominik
Copy link
Contributor

I had a look of at such an environment and I see at least two openmp libs, openmp and intel-openmp... if you dig into these packages a bit you can find that both of them come with a version of libiomp5md.dll (note the different sizes, hashes):

# intel-openmp
      {
        "_path": "Library/bin/libiomp5md.dll",
        "path_type": "hardlink",
        "sha256": "3e69ef3e52deff22ba0e44c36e1198dcae1657098e63a2d550842b717eb59175",
        "sha256_in_prefix": "3e69ef3e52deff22ba0e44c36e1198dcae1657098e63a2d550842b717eb59175",
        "size_in_bytes": 1310600
      },


# openmp:
      {
        "_path": "Library/bin/libiomp5md.dll",
        "path_type": "hardlink",
        "sha256": "cf43b5e78b8664c734366ad99bbe7a367fe50d3844f267c920f1e6280d3c56be",
        "sha256_in_prefix": "cf43b5e78b8664c734366ad99bbe7a367fe50d3844f267c920f1e6280d3c56be",
        "size_in_bytes": 612864
      },

not sure why conda doesn't issue a clobber warning.

My guess would be this causes the problem. So the pulling in of openmp per se is not the problem - but clobbering is.

versions of openmp aside, you can also "fix" this with conda install -c conda-forge intel-openmp --force-reinstall which just overwrites libiomp5md.dll and friends with an ABI compatible version that pytorch was build with.

@btbest
Copy link
Contributor Author

btbest commented Nov 13, 2024

I had a look of at such an environment and I see at least two openmp libs, openmp and intel-openmp... if you dig into these packages a bit you can find that both of them come with a version of libiomp5md.dll (note the different sizes, hashes)

🤦 I hadn't noticed that. Not just with openmp btw; in an attempt with a newer libflang (18.1.1), llvm-openmp clobbers it the same way.

not sure why conda doesn't issue a clobber warning.

but it's happy to issue dozens of them on our CI, none of them seemingly an issue :(

My guess would be this causes the problem. So the pulling in of openmp per se is not the problem - but clobbering is.

versions of openmp aside, you can also "fix" this with conda install -c conda-forge intel-openmp --force-reinstall which just overwrites libiomp5md.dll and friends with an ABI compatible version that pytorch was build with.

This does seem to work; so I think we can choose to pin pytorch 2.3, or unpin pytorch and do this force-reinstall instead. Do you have a preference for either? Neither is really clean. The force-reinstall would extend CI duration. We could even pin for CI and force-reinstall for release, though CI durations aren't enough of a bottleneck to warrant that extra complexity I think.

@k-dominik
Copy link
Contributor

strong preference for pinning pytorch<2.4 - this allows at least for some flexibility when creating ilastik envs - in release we pin to 2.3 currently.

The reinstall "hack" will break something else - I'm sure :)

@btbest
Copy link
Contributor Author

btbest commented Nov 13, 2024

do you have a link to a source for "because libomp140 was no longer shipped in a VS redistributable.", or did you diff different VS redistributables yourself?

For documentation, I had read that over here:

fbgemm.dll seems to require libomp140.x86_64.dll, and that one was removed from the Windows 11 C++ redistributables, as it seems

But as it turns out isn't related to the issue we have here.

Clash on Win - pyshtools installs openmp, which clobbers dlls needed by pytorch's intel-openmp since pytorch 2.4

See bfab86d (similar problem) and ilastik#2927
@btbest
Copy link
Contributor Author

btbest commented Nov 14, 2024

Fixes #2922

@btbest
Copy link
Contributor Author

btbest commented Nov 15, 2024

I've updated the original post to summarise my understanding of what's going on now for future reference

@btbest btbest requested a review from k-dominik November 15, 2024 10:00
@btbest btbest merged commit cca71b7 into ilastik:main Nov 15, 2024
16 checks passed
@btbest btbest deleted the pytorch-libfbgemm branch November 15, 2024 12:11
k-dominik pushed a commit to k-dominik/ilastik that referenced this pull request Nov 29, 2024
Clash on Win - pyshtools installs openmp, which clobbers dlls needed by pytorch's intel-openmp since pytorch 2.4

See bfab86d (similar problem)
k-dominik pushed a commit to k-dominik/ilastik that referenced this pull request Dec 9, 2024
Clash on Win - pyshtools installs openmp, which clobbers dlls needed by pytorch's intel-openmp since pytorch 2.4

See bfab86d (similar problem)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants