wmedian: the first weight is ignored when ProbabilityWeights is used

Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:

The frequency weights seems to do what I'd expect:
```julia
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1000, 1, 1, 1]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1000, 1, 1]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1000, 1]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1, 1000]))
4.0
```

unlike the ProbabilityWeights:
```julia
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1000, 1, 1, 1]/1003))
2.500000000000056
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1000, 1, 1]/1003))
1.501
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1000, 1]/1003))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1, 1000]/1003))
3.499
```

which seems to:
- ignore the first weight purely and simply (isn't that a bug ?? -- it's hard to believe it comes down to definition, unless it's a definition for continuous analytics applied bluntly to discrete arrays -- I ask the author(s) to please excuse me, I mean no offense)
- takes the mean between whatever two numbers the (weighted) median falls into, which can never be exactly reached

Worse, it is numerically inaccurate:
```julia
julia> eps = 1e-50 ;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
4.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
1.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
3.5
```
whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights
```julia
julia> eps = 0.;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
4.0
```

IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).

Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.

_Originally posted by @perrette in https://github.com/JuliaStats/StatsBase.jl/issues/435#issuecomment-2162412722_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wmedian: the first weight is ignored when ProbabilityWeights is used #933

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wmedian: the first weight is ignored when ProbabilityWeights is used #933

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions