-
-
Notifications
You must be signed in to change notification settings - Fork 24
Cache various statistics to improve performance #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Order of operations was wrong before, causing incorrect outputs.
The implementation uses an AVL tree to keep track of the low and high parts of the input array, and then updates the trees accordingly in O(log n) time (instead of O(n) provided by numpy.median).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #204 +/- ##
==========================================
- Coverage 98.47% 98.21% -0.27%
==========================================
Files 43 43
Lines 1839 2075 +236
Branches 114 129 +15
==========================================
+ Hits 1811 2038 +227
- Misses 25 31 +6
- Partials 3 6 +3
🚀 New features to boost your workflow:
|
|
Regarding the median: I was not satisfied with the performance of pure Python implementations of sorted containers which have |
An implementation that resolves #201.
Describe your changes
This is a rather large-looking PR (only because I am basing this off of #198 for ease-of-merge!), and is a sort of a first attempt (which is why I am marking it as a draft) at caching the various statistics properties as described in #201.
TL;DR: since at each iteration we only move one point in the dataset (basically map$(x_i, y_i) \mapsto (x'_i, y'_i)$ ), why not compute the new statistics ($\langle X' \rangle$ , $\text{Var}(X') $ , and $r(X', Y') $ ) in terms of the old ones ($\langle X \rangle$ , $\text{Var}(X) $ , and $r(X, Y) $ ) + the old ($(x_i, y_i) $ ) and new ($(x'_i, y'_i) $ ) points? This is essentially what this PR does.
Details
Mathematical derivation
The mean is shifted as:
where$\delta_x = x'_i - x_i$ (same for $Y$ ), the variance (or the standard deviation if you want) is shifted as:
while the correlation coefficient is shifted as:
where the only new quantity to keep track of is$\langle X Y \rangle$ (the rest can be obtained from the above 2).
Workflow
With the math out of the way, this is how the new way of computing the statistics works:
(x, y)dataset, and make an instance ofStatistics(subject to change for its somewhat ambiguous naming) from itStatisticshas a method,perturb, which takes in 3 arguments: the row (index) we wish to perturb, and the 2 values of the perturbations (in thexandydirections), and returns an instance of the newSummaryStatistics(i.e. the one from the perturbed data)_is_close_enoughreturnsTrue, we call theperturbmethod again (it'supdate=True, which actually updates the data in theStatisticsinstanceNote that there's a bunch of new methods (
shifted_mean,shifted_var,shifted_stdev,shifted_corrcoef), which we may or may not want to make private (or refactor so they accept different arguments), as I didn't put too much thought into the overall API design (again, hence the "draft" status).Performance
Now, onto the performance results! Note that the benchmarks were done with #201 already in there, so the absolute numbers differ from the ones on
main, but the reduction in the number of function calls is evident. I tested everything usingpython -m cProfile -m data_morph --seed 42 --start-shape panda --target-shape [shape].so we use 26.1M less function calls for all shapes in question, which results in about 35% faster performance (computed as (t_old - t_new) / t_old) * 100).
Discussion items
Some possible points of discussion:
shifted_*functions private? Couldperturbdo without anupdateargument? etc.)shifted_*functions are a bit repetitive, and could take a while to run (I was initially very skeptical of the numerical stability, but it appears to be fine, though it could just be that I haven't tried enough scenarios to hit an instability)Feedback is welcome :)
Checklist