Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: clarify the documentation for the loss functions used in GBRT, and Absolute Error in particular. #30339

Open
AhmedThahir opened this issue Nov 23, 2024 · 13 comments

Comments

@AhmedThahir
Copy link

AhmedThahir commented Nov 23, 2024

Describe the bug

From my understanding, currently there is no way to minimize the MAE (Mean Absolute Error). Quantile regression with quantile=0.5 will optimize for the Median Absolute Error. This would be different from optimizing the MAE when the conditional distribution of the response variable is not symmetrically-distributed.

if sample_weight is None:
return np.median(y_true, axis=0)
else:
return _weighted_percentile(y_true, sample_weight, 50)

What I expect

  • Using HistGradientBoostingRegressor(loss="absolute_error") should optimize for the mean of absolute errors.
  • Using HistGradientBoostingRegressor(loss="quantile", quantile=0.5) should optimize for the median of absolute errors.
        if sample_weight is None:
            return np.mean(y_true, axis=0)
        else:
            return _weighted_mean(y_true, sample_weight)

What happens
Both give the same results

  • Using HistGradientBoostingRegressor(loss="absolute_error") optimizes for the median of absolute errors
  • Using HistGradientBoostingRegressor(loss="quantile", quantile=0.5) optimizes for the median of absolute errors

Suggested Actions

If this is intended behavior:

  • Feel free to close this issue marked as resolved.
  • Kindly add a note in the documentation that "Absolute Error optimizes for Median Absolute Error, not Mean Absolute Error" as "absolute_error" is not very clear.
  • I would appreciate if there was more explanation regarding on using custom loss functions Custom Loss function? #21614. This way, we could optimize for Mean Absolute Error, Median Absolute Error, Log Cosh, etc. as per the requirement.

Note
I have tried my best to go through the documentation prior to creating this issue. I am a fresh graduate in Computer Science, and if you believe this issue is not well-framed due to a misunderstanding of my concepts, kindly advise me and I'll work on it.

Steps/Code to Reproduce

# Imports
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np

# Dataset Generation
x = np.linspace(start=0, stop=10, num=100)

n_repeat = 100 # no of x for each x
X = np.repeat(x, n_repeat)[:, np.newaxis]
y_true_mean = 1 * np.repeat(x, n_repeat)
noise = np.random.RandomState(0).lognormal(mean=0, sigma=1, size=y_true_mean.shape[0])
y_noisy = y_true_mean + noise

# Model Creation
mae = HistGradientBoostingRegressor(loss="absolute_error") # should be mean of absolute errors
quantile = HistGradientBoostingRegressor(loss="quantile", quantile=0.5) # should be median of absolute errors

# Fit & Prediction
y_pred_mae = mae.fit(X, y_noisy).predict(X)
y_pred_quantile = quantile.fit(X, y_noisy).predict(X)

# Prediction Comparison
print((y_pred_mae - y_pred_quantile).sum()) # both give same results

Expected Results

Median and mean of absolute errors should give different results for a log-normally distributed response. Hence, the predictions should be different from each other, and the difference of their predictions, should total as a non-zero value.

Actual Results

Predictions by both models are the same, which can be seen in the difference of their predictions, totaling as 0.

0.

Versions

System:
    python: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.1.85+-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.5.2
          pip: 24.1.2
   setuptools: 75.1.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: 3.0.11
       pandas: 2.2.2
   matplotlib: 3.8.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 2
         prefix: libopenblas
       filepath: /usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 2
         prefix: libopenblas
       filepath: /usr/local/lib/python3.10/dist-packages/scipy.libs/libopenblasp-r0-01191904.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 2
         prefix: libgomp
       filepath: /usr/local/lib/python3.10/dist-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
@AhmedThahir AhmedThahir added Bug Needs Triage Issue requires triage labels Nov 23, 2024
@AhmedThahir AhmedThahir changed the title How to optimize for Mean Absolute Error Absolute Error optimizing for Median Absolute Error, not Mean Absolute Error Nov 23, 2024
@glemaitre
Copy link
Member

The part of the code that you are pointing out is the computation of the intercept. The estimator of a model minimizing the mean absolute error is the median. So the computation of the intercept looks logical to me.

@AhmedThahir
Copy link
Author

AhmedThahir commented Nov 25, 2024

From my understanding, optimizing Mean Absolute Error and Median Absolute Error should give different results, but I believe that either I have not been able to communicate my question clearly, or have I not provided sufficient backing for my question, or I have confused the code for the intercept above.

Any resources would be much appreciated.

I would appreciate if there was more explanation regarding on using custom loss functions Custom Loss function? #21614. This way, we could optimize for Mean Absolute Error, Median Absolute Error, Log Cosh, etc. as per the requirement.

Aside from that, I will try my best to research further and formulate a better question in the future. You may close this issue. Thank you!

@ogrisel
Copy link
Member

ogrisel commented Dec 2, 2024

The median is the minimizer of the mean absolute error. Let's sample same data distributed in such a way that the mean and the median are different:

import numpy as np

data = np.random.lognormal(mean=0, sigma=1, size=1000)
np.median(data).round(4), np.mean(data).round(4)
(np.float64(0.9989), np.float64(1.6216))

Let's find the minimizer of MAE using scipy.optimize:

from scipy.optimize import fmin


def mae(x, data):
    return np.mean(np.abs(data - x))

x = np.zeros(shape=1, dtype=np.float64)
fmin(mae, x, args=(data,), disp=False)
array([0.99875])

@ogrisel
Copy link
Member

ogrisel commented Dec 2, 2024

Using HistGradientBoostingRegressor(loss="absolute_error") optimizes for the median of absolute errors

No, HistGradientBoostingRegressor(loss="absolute_error") minimizes the sum or mean absolute error of the conditional estimates (as does the quantile loss with q=0.5). Both choices lead to an estimate of the conditional median of Y|X as a result.

@ogrisel
Copy link
Member

ogrisel commented Dec 2, 2024

Please point us to the specific part of the documentation that was not clear and suggest a way to improve it. Otherwise I think we can close.

@AhmedThahir
Copy link
Author

The median is then minimizer of the mean absolute error.

Noted. You are correct; I replicated your code, thanks!

@AhmedThahir
Copy link
Author

AhmedThahir commented Dec 2, 2024

No, HistGradientBoostingRegressor(loss="absolute_error") minimizes the (conditional) mean absolute error (as does the quantile loss with q=0.5) and both estimate the conditional median of Y|X as a result.

But code for quantile = 0.5 (ie, median) gives me a different result from that of mean; this is my point:

def quantileae(x, data):
    return np.quantile(np.abs(data - x), 0.5)

x = np.zeros(shape=1, dtype=np.float64)
fmin(quantileae, x, args=(data,), disp=False)
array([0.58675])
def medianae(x, data):
    return np.median(np.abs(data - x))

x = np.zeros(shape=1, dtype=np.float64)
fmin(medianae, x, args=(data,), disp=False)
array([0.58675])

@AhmedThahir
Copy link
Author

AhmedThahir commented Dec 2, 2024

Please point us to the specific part of the documentation that was not clear and suggest a way to improve it. Otherwise I think we can close.

Problem
The term "squared_error" and "absolute_error" is not very clear: ie, is it Mean Squared Error or Median Squared Error; Mean Absolute Error or Median Absolute Error. The documentation should make it clear that it is Mean Squared Error and Mean Absolute Error.

Specific part of the documentation
https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html

loss{‘squared_error’, ‘absolute_error’, ‘gamma’, ‘poisson’, ‘quantile’}, default=’squared_error’
The loss function to use in the boosting process. Note that the “squared error”, “gamma” and “poisson” losses actually implement “half least squares loss”, “half gamma deviance” and “half poisson deviance” to simplify the computation of the gradient. Furthermore, “gamma” and “poisson” losses internally use a log-link, “gamma” requires y > 0 and “poisson” requires y >= 0. “quantile” uses the pinball loss.

Suggestion

The loss function to use in the boosting process. Note that the “squared error”, “gamma” and “poisson” losses actually implement “half least squares loss”, “half gamma deviance” and “half poisson deviance” to simplify the computation of the gradient. Furthermore, “gamma” and “poisson” losses internally use a log-link, “gamma” requires y > 0 and “poisson” requires y >= 0. “quantile” uses the pinball loss. “absolute_error” refers to Mean Absolute Error; for other variants of absolute error, you need to write your own loss function, “squared_error” refers to Mean Squared Error; for other variants of squared error, you need to write your own loss function

@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2024

But code for quantile = 0.5 (ie, median) gives me a different result from that of mean; this is my point

The quantile loss of HistGradientBoostingRegressor is not equivalent to the code you wrote above. The quantile loss is also known as the pinball loss, and it is documented here:

https://scikit-learn.org/stable/modules/model_evaluation.html#pinball-loss

Here is the equivalent Python code:

def quantile_loss(x, data, q=0.5):
    return np.mean(np.maximum(q * (data - x), (q - 1) * (data - x)))

x = np.zeros(shape=1, dtype=np.float64)
fmin(quantile_loss, x, args=(data,), disp=False)
array([0.99875])

What we could do is expand the following section:

https://scikit-learn.org/stable/modules/ensemble.html#loss-functions

to link to the matching metrics function from https://scikit-learn.org/stable/modules/model_evaluation.html.

To make that more explicit and reference this section from the docstring.

@ogrisel ogrisel changed the title Absolute Error optimizing for Median Absolute Error, not Mean Absolute Error DOC: clarify the documentation for the loss functions used in GBRT, and Absolute Error in particular. Dec 3, 2024
@ogrisel ogrisel added Documentation and removed Bug Needs Triage Issue requires triage labels Dec 3, 2024
@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2024

“absolute_error” refers to Mean Absolute Error; for other variants of absolute error, you need to write your own loss function, “squared_error” refers to Mean Squared Error; for other variants of squared error, you need to write your own loss function

I would not write that: for all loss functions, it's always computed as the sum (or equivalently the mean) of the individual loss function computed on each data point in the training set: this is the case for absolute and squared error but also for quantile, Poisson, gamma... Furthermore, we should not advertise for custom loss functions in the doc as there is no official public API to do so in scikit-learn.

@AhmedThahir
Copy link
Author

we should not advertise for custom loss functions in the doc as there is no official public API to do so in scikit-learn.

Noted. That is right.

@AhmedThahir
Copy link
Author

for all loss functions, it's always computed as the sum (or equivalently the mean) of the individual loss function computed on each data point in the training set

This is the point I'm not able to understand. Why is that so: I'm assuming this is implied or an unspoken rule in Machine Learning that mean is the standard aggregate function for loss functions, is it?

So far, I've always specified mean/median whenever I mention a loss function as I was not aware of such. If you could share any resource that I could read, I would appreciate it.

@glemaitre
Copy link
Member

This is the point I'm not able to understand. Why is that so: I'm assuming this is implied or an unspoken rule in Machine Learning that mean is the standard aggregate function for loss functions, is it?

You probably want to refer to:
https://en.wikipedia.org/wiki/Empirical_risk_minimization

You are interested in minimizing the risk that is defined as the expectation of the loss function and thus the sum aggregate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants