-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT Allow the vector-form representation of symetric distance matrices as input #29133
Comments
This seems like a nice idea. WDYT @jjerphan , @Micky774 , @jeremiedbb ? |
Indeed, this might be a good contribution to introduce as a strategy after #26983 gets in. I do not think creating a structure like We could have a vector-form distance such as the one returned by Edit: I actually did not understand your request at first. I just have changed the title of this issue so that it is more consistent. Ideally, we should not materialize this square form representation of the distance matrix or any other ones but have the algorithm support the boolean distance metrics (like Jaccard's) directly. Out of curiosity, is there something blocking you from using: X = # Data
clusterer = HDBSCAN(metric="jaccard")
clusterer.fit(X) ? |
scipy.spatial.distance.pdist
)
scipy.spatial.distance.pdist
)
@jjerphan at the time of writing this I was benchmarking Is the square form the redundant form or the 1d non-redundant form? I thought it was the former but the change in title makes me think otherwise. |
I meant vector-form and adapted the title accordingly again. I was confused by |
This sounds worth doing, however I agree that perhaps the best way forwards is through #26983, and maybe a buffer wrapper so that we can have e.g. |
Would it make sense to force full distance matrices (ie redundant and square) into non-redundant vector form? To lower the memory footprint overall or to allow full distances if provided to increase performance at the cost of memory? |
I don't think we'd want to force the vector form. Most usecases don't mind the memory footprint. |
Also, most use cases do not have a symmetric distance matrix as well as generally |
Describe the workflow you want to enable
I would like to calculate the upper triangle of a distance from$0.5 * (N^2 - N)$ values and use this as input to
scipy.spatial.distance.pdist
instead of the redundant and memory intensive version fromsklearn.metrics.pairwise_distances
which hasmetric="precompute"
Describe your proposed solution
In addition to the original functionality (or ideally replacing):
Describe alternatives you've considered, if relevant
Using the redundant square form but this requires a lot more memory that isn't necessary.
Additional context
It may be worthwhile creating a
DistanceMatrix
object like https://scikit.bio/docs/latest/generated/skbio.stats.distance.DistanceMatrix.html#skbio.stats.distance.DistanceMatrixThe text was updated successfully, but these errors were encountered: