Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDBSCAN models require consistent version of sklearn to open #213

Closed
victoriapascal opened this issue Jul 25, 2022 · 16 comments
Closed

HDBSCAN models require consistent version of sklearn to open #213

victoriapascal opened this issue Jul 25, 2022 · 16 comments
Labels
bug Something isn't working model Changes to the model wontfix This will not be worked on

Comments

@victoriapascal
Copy link

Currently trying to install poppunk version 2.3.0.

I used to have a conda installation with poppunk 2.3.0 and poppunk_sketch 1.7.4. I created my own poppunk database to be able to run poppunk_assign, poppunk --fit-model and poppunk_visualise . Now I need to re-create a conda environment that includes poppunk but it seems that when installing this same version poppunk_sketch is not included in the installation, which I believe is needed to run poppunk_assign. I attach here a file with the exact error I get when trying to run the commands mentioned above (see poppunk_sketch_error.txt). I have also tried installing other versions in case that would solve the error but I also get an error that it's shown in the other file (see poppunk_2.4_error_message.txt, not sure this is related at all but just in case), I also tried installing it via pip, copying the poppunk_sketch exe... but nothing seems to work. I would really appreciate some help in here.

Thanks a lot for the help in advance!

Victoria

poppunk_2.4_error_message.txt
poppunk_sketch_error.txt

@johnlees
Copy link
Member

johnlees commented Aug 2, 2022

I'm not immediately sure what's going wrong here unfortunately. It seems like the version string from pp-sketchlib isn't as expected.

Can you run poppunk_sketch in your installation?
If you start a python session and run import pp_sketchlib what happens?

Copying out some relevant parts for reference:

'/bin/sh: 1: poppunk_sketch: not found\nTraceback (most recent call last):\n  File "/opt/conda/bin/poppunk_assign", line 10, in <module>\n    sys.exit(main())\n  File "/opt
/conda/lib/python3.9/site-packages/PopPUNK/assign.py", line 389, in main\n    dbFuncs = setupDBFuncs(args, args.min_kmer_count, qc_dict)\n  File "/opt/conda/lib/python3.9/si
te-packages/PopPUNK/utils.py", line 58, in setupDBFuncs\n    version = checkSketchlibVersion()\n  File "/opt/conda/lib/python3.9/site-packages/PopPUNK/sketchlib.py", line 49
, in checkSketchlibVersion\n    version = line.rstrip().decode().split(" ")[1]\nIndexError: list index out of range\n'

Looks like both sketchlib 2.0.0 and poppunk 2.4.0 are installed

@johnlees
Copy link
Member

johnlees commented Aug 2, 2022

Note: a similar error appears in #210

@johnlees johnlees added bug Something isn't working package Packaging of code labels Aug 2, 2022
@johnlees johnlees mentioned this issue Aug 2, 2022
37 tasks
@victoriapascal
Copy link
Author

Thanks for the quick answer. poppunk_sketch seems not be installed. In the /conda/bin/ there are different executables (poppunk, poppunk_assign, poppunk_prune....) and *.py files (poppunk_add_weights.py, poppunk_batch_mst.py...) but not poppunk_sketch. I am able to import pp_sketchlib without problems though.

@johnlees
Copy link
Member

johnlees commented Aug 3, 2022

And conda list in that environment shows pp-sketchlib >=2.0.0?

@victoriapascal
Copy link
Author

Yes, that's the case (see list of packages attached).

conda_list.txt

@johnlees
Copy link
Member

johnlees commented Aug 3, 2022

Could you try making a fresh environment with conda create -n pp_retry poppunk==2.4.0 pp-sketchlib==2.0.0 and see if you have any luck there?

@victoriapascal
Copy link
Author

A new installation still gives me an error (see attached file). In this env, poppunk is installed but not poppunk_sketch still.

pp_fresh_install.txt

@johnlees
Copy link
Member

johnlees commented Aug 3, 2022

Ah, apologies, I now see the problem. From pp-sketchlib 2.0.0 poppunk_sketch was renamed to sketchlib. If you instead install pp-sketchlib v1.7.4 it should work ok.

From v2.5.0 this will be updated and fixed so they work together.

The only thing I don't understand is why you aren't getting the version from the library file. Could you try running, in a python session:

import pp_sketchlib
pp_sketchlib.version
dir(pp_sketchlib)

@victoriapascal
Copy link
Author

Indeed, installing pp-sketchlib v1.7.4 makes poppunk_sketch available but still I get an error when I run poppunk_assign (see attached file). I also attach another file to show you what I get from running the import and the other commands in my python session. How do you run poppunk_assign in poppunk version 2.4 then?

pp_assign_error.txt
python_import.txt

@johnlees
Copy link
Member

johnlees commented Aug 3, 2022

That's a different error now, which appears to be caused by scikit-learn changing their API. Can you try downgrading to v0.24? I'll need to put in a fix for this in future versions

@johnlees
Copy link
Member

johnlees commented Aug 3, 2022

Would you be able to attach the fit.pkl file you are using here?

@victoriapascal
Copy link
Author

Thanks! Downgrading to v0.24 solved the issue indeed. I attach here the pkl I'm using for this run.
vanAB_dataset_updated.dists.pkl.zip

@johnlees
Copy link
Member

johnlees commented Aug 4, 2022

Ok, glad to hear this sorted the issue!

The above pickle I think is the sample labels/dists pickle, do you also have a _fit.pkl you could share so I can look into the error?

@victoriapascal
Copy link
Author

Do you mean this one?
vanAB_dataset_updated_fit.pkl.zip

@johnlees
Copy link
Member

johnlees commented Aug 4, 2022

Ok that's great thank you, I can now replicate

@johnlees johnlees changed the title poppunk_sketch missing after poppunk installation HDBSCAN models require consistent version of sklearn to open Aug 4, 2022
@johnlees johnlees added wontfix This will not be worked on model Changes to the model and removed package Packaging of code labels Aug 4, 2022
@johnlees
Copy link
Member

johnlees commented Aug 4, 2022

Just to state the problem and resolution here:
sklearn changed it's API from v0.24 -> v1.0 so that loading a HDBSCAN model created with a different version won't work. So loading an older HDBSCAN fit with a newer sklearn installation throws an error as reported above:

ModuleNotFoundError: No module named 'sklearn.neighbors._dist_metrics'

Most distributed models don't use this mode, so I don't forsee this being a big problem. I will add a note to the documentation that to use such a model the sklearn version needs to be downgraded, or that you may generally want to run refine model to give a simpler & faster model in the first place.
We could write a script to convert HDBSCAN models to the new version of the API, but this would require some digging into the pickle, and it's not immediately clear to me from the sklearn docs what they changed and how to update the dist_metrics part. Will do this only if it keeps cropping up and downgrading sklearn is no longer viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working model Changes to the model wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants