Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Allow the vector-form representation of symetric distance matrices as input #29133

Open
jolespin opened this issue May 29, 2024 · 8 comments

Comments

@jolespin
Copy link

jolespin commented May 29, 2024

Describe the workflow you want to enable

I would like to calculate the upper triangle of a distance from scipy.spatial.distance.pdist instead of the redundant and memory intensive version from sklearn.metrics.pairwise_distances which has $0.5 * (N^2 - N)$ values and use this as input to metric="precompute"

Describe your proposed solution

from scipy.spatial.distance import pdist
X = # Data
dist = pdist(X, metric="jaccard") # Or any other but I just jaccard a lot for my boolean datasets.  Just didn't want to use euclidean/cosine 
clusterer = HDBSCAN(metric="precomputed_triangle")
clusterer.fit(dist)

In addition to the original functionality (or ideally replacing):

from scipy.spatial.distance import pdist, squareform
X = # Data
dist = squareform(pdist(X, metric="jaccard")) # Or any other but I just jaccard a lot for my boolean datasets.  Just didn't want to use euclidean/cosine 
clusterer = HDBSCAN(metric="precomputed")
clusterer.fit(dist)

Describe alternatives you've considered, if relevant

Using the redundant square form but this requires a lot more memory that isn't necessary.

Additional context

It may be worthwhile creating a DistanceMatrix object like https://scikit.bio/docs/latest/generated/skbio.stats.distance.DistanceMatrix.html#skbio.stats.distance.DistanceMatrix

@jolespin jolespin added Needs Triage Issue requires triage New Feature labels May 29, 2024
@adrinjalali adrinjalali removed the Needs Triage Issue requires triage label Jun 3, 2024
@adrinjalali
Copy link
Member

This seems like a nice idea. WDYT @jjerphan , @Micky774 , @jeremiedbb ?

@jjerphan
Copy link
Member

jjerphan commented Jun 3, 2024

Indeed, this might be a good contribution to introduce as a strategy after #26983 gets in.

I do not think creating a structure like skbio.stats.distance.DistanceMatrix would be helpful, at least from a memory footprint perspective as it seems it stores an entire dense matrix.

We could have a vector-form distance such as the one returned by scipy.spatial.distance.squareform (i.e. which would store $\frac{n (n - 1)}{2}$ elements using a flat 1-D arrays) which would have an interface allowing accessing the distance between $x_i$ and $x_i$, $\forall (i,j) \in [0 .. n-1]^2$.


Edit: I actually did not understand your request at first. I just have changed the title of this issue so that it is more consistent. Ideally, we should not materialize this square form representation of the distance matrix or any other ones but have the algorithm support the boolean distance metrics (like Jaccard's) directly.

Out of curiosity, is there something blocking you from using:

X = # Data
clusterer = HDBSCAN(metric="jaccard")
clusterer.fit(X)

?

@jjerphan jjerphan changed the title [Feature request] Allow 1D upper triangle representations of distance matrices (e.g., output from scipy.spatial.distance.pdist) FEAT Allow 1D upper triangle representations of distance matrices (e.g., output from scipy.spatial.distance.pdist) Jun 3, 2024
@jjerphan jjerphan changed the title FEAT Allow 1D upper triangle representations of distance matrices (e.g., output from scipy.spatial.distance.pdist) FEAT Allow the squareform representation of symetric distance matrices as input Jun 3, 2024
@jolespin
Copy link
Author

jolespin commented Jun 5, 2024

@jjerphan at the time of writing this I was benchmarking HDBSCAN by comparing to some "gold-standard" labels for my dataset which includes >50k observations. It just seemed like a waste to recompute the distances each time so I wanted to use the precompute option and then I realized that it was using the redundant square form. When I took a deep dive into the other classes on sklearn, it looked like most required the precomputed distance to be in the redundant square form.

Is the square form the redundant form or the 1d non-redundant form? I thought it was the former but the change in title makes me think otherwise.

@jjerphan jjerphan changed the title FEAT Allow the squareform representation of symetric distance matrices as input FEAT Allow the vector-form representation of symetric distance matrices as input Jun 5, 2024
@jjerphan
Copy link
Member

jjerphan commented Jun 5, 2024

I meant vector-form and adapted the title accordingly again.

I was confused by scipy.spatial.distance.squareform which converts the one in the other and vice versa.

@adrinjalali adrinjalali added help wanted Hard Hard level of difficulty labels Jun 6, 2024
@Micky774
Copy link
Contributor

Micky774 commented Jun 6, 2024

This sounds worth doing, however I agree that perhaps the best way forwards is through #26983, and maybe a buffer wrapper so that we can have e.g. FullDistance to wrap the vector-form distance, allowing existing infrastructure to treat FullDistance as a redundant squareform matrix without materializing extra memory. This would come at the expense of some performance due to indirection, but IMO a slightly lower-performance, lower-memory option here sounds like a good addition. I'm curious what folks think.

@jolespin
Copy link
Author

jolespin commented Jun 12, 2024

Would it make sense to force full distance matrices (ie redundant and square) into non-redundant vector form? To lower the memory footprint overall or to allow full distances if provided to increase performance at the cost of memory?

@adrinjalali
Copy link
Member

I don't think we'd want to force the vector form. Most usecases don't mind the memory footprint.

@jjerphan
Copy link
Member

Also, most use cases do not have a symmetric distance matrix as well as generally X_train != X_test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants