-
Notifications
You must be signed in to change notification settings - Fork 197
Description
Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:
The frequency weights seems to do what I'd expect:
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1000, 1, 1, 1]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1000, 1, 1]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1000, 1]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1, 1000]))
4.0unlike the ProbabilityWeights:
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1000, 1, 1, 1]/1003))
2.500000000000056
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1000, 1, 1]/1003))
1.501
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1000, 1]/1003))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1, 1000]/1003))
3.499which seems to:
- ignore the first weight purely and simply (isn't that a bug ?? -- it's hard to believe it comes down to definition, unless it's a definition for continuous analytics applied bluntly to discrete arrays -- I ask the author(s) to please excuse me, I mean no offense)
- takes the mean between whatever two numbers the (weighted) median falls into, which can never be exactly reached
Worse, it is numerically inaccurate:
julia> eps = 1e-50 ;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
4.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
1.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
3.5whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights
julia> eps = 0.;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
4.0IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).
Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.
Originally posted by @perrette in #435 (comment)