-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: clarify the documentation for the loss functions used in GBRT, and Absolute Error in particular. #30339
Comments
The part of the code that you are pointing out is the computation of the intercept. The estimator of a model minimizing the mean absolute error is the median. So the computation of the intercept looks logical to me. |
From my understanding, optimizing Mean Absolute Error and Median Absolute Error should give different results, but I believe that either I have not been able to communicate my question clearly, or have I not provided sufficient backing for my question, or I have confused the code for the intercept above. Any resources would be much appreciated.
Aside from that, I will try my best to research further and formulate a better question in the future. You may close this issue. Thank you! |
The median is the minimizer of the mean absolute error. Let's sample same data distributed in such a way that the mean and the median are different: import numpy as np
data = np.random.lognormal(mean=0, sigma=1, size=1000)
np.median(data).round(4), np.mean(data).round(4)
Let's find the minimizer of MAE using from scipy.optimize import fmin
def mae(x, data):
return np.mean(np.abs(data - x))
x = np.zeros(shape=1, dtype=np.float64)
fmin(mae, x, args=(data,), disp=False)
|
No, |
Please point us to the specific part of the documentation that was not clear and suggest a way to improve it. Otherwise I think we can close. |
Noted. You are correct; I replicated your code, thanks! |
But code for quantile = 0.5 (ie, median) gives me a different result from that of mean; this is my point: def quantileae(x, data):
return np.quantile(np.abs(data - x), 0.5)
x = np.zeros(shape=1, dtype=np.float64)
fmin(quantileae, x, args=(data,), disp=False)
def medianae(x, data):
return np.median(np.abs(data - x))
x = np.zeros(shape=1, dtype=np.float64)
fmin(medianae, x, args=(data,), disp=False)
|
Problem Specific part of the documentation
Suggestion
|
The quantile loss of https://scikit-learn.org/stable/modules/model_evaluation.html#pinball-loss Here is the equivalent Python code: def quantile_loss(x, data, q=0.5):
return np.mean(np.maximum(q * (data - x), (q - 1) * (data - x)))
x = np.zeros(shape=1, dtype=np.float64)
fmin(quantile_loss, x, args=(data,), disp=False)
What we could do is expand the following section: https://scikit-learn.org/stable/modules/ensemble.html#loss-functions to link to the matching metrics function from https://scikit-learn.org/stable/modules/model_evaluation.html. To make that more explicit and reference this section from the docstring. |
I would not write that: for all loss functions, it's always computed as the sum (or equivalently the mean) of the individual loss function computed on each data point in the training set: this is the case for absolute and squared error but also for quantile, Poisson, gamma... Furthermore, we should not advertise for custom loss functions in the doc as there is no official public API to do so in scikit-learn. |
Noted. That is right. |
This is the point I'm not able to understand. Why is that so: I'm assuming this is implied or an unspoken rule in Machine Learning that mean is the standard aggregate function for loss functions, is it? So far, I've always specified mean/median whenever I mention a loss function as I was not aware of such. If you could share any resource that I could read, I would appreciate it. |
You probably want to refer to: You are interested in minimizing the risk that is defined as the expectation of the loss function and thus the sum aggregate. |
Describe the bug
From my understanding, currently there is no way to minimize the MAE (Mean Absolute Error). Quantile regression with quantile=0.5 will optimize for the Median Absolute Error. This would be different from optimizing the MAE when the conditional distribution of the response variable is not symmetrically-distributed.
scikit-learn/sklearn/_loss/loss.py
Lines 574 to 577 in 46a7c9a
What I expect
HistGradientBoostingRegressor(loss="absolute_error")
should optimize for the mean of absolute errors.HistGradientBoostingRegressor(loss="quantile", quantile=0.5)
should optimize for the median of absolute errors.What happens
Both give the same results
HistGradientBoostingRegressor(loss="absolute_error")
optimizes for the median of absolute errorsHistGradientBoostingRegressor(loss="quantile", quantile=0.5)
optimizes for the median of absolute errorsSuggested Actions
If this is intended behavior:
Note
I have tried my best to go through the documentation prior to creating this issue. I am a fresh graduate in Computer Science, and if you believe this issue is not well-framed due to a misunderstanding of my concepts, kindly advise me and I'll work on it.
Steps/Code to Reproduce
Expected Results
Median and mean of absolute errors should give different results for a log-normally distributed response. Hence, the predictions should be different from each other, and the difference of their predictions, should total as a non-zero value.
Actual Results
Predictions by both models are the same, which can be seen in the difference of their predictions, totaling as 0.
Versions
The text was updated successfully, but these errors were encountered: