-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise predict to allow all 'type' #172
base: master
Are you sure you want to change the base?
Conversation
I apologize. This is third submission. I am unable to figure out how to squash my commits in order for them to be reviewed. This is my first PR ever. |
Don't worry, we can work with you to get it through. You seem to understand what needs to be done here so I'm sure it won't take us long. Some comments here:
|
I will work on a new version - should not take long. I do not understand how to submit a revised PR as opposed to creating a new PR. Am doing this via GitHub web interface as I am flummoxed by the git command line interface. Easy to place margin back in and take out the parameter checking lines. I understand the implications of
The return shape for predict |
I was able to make some modifications. They are showing above so I guess I don't need to do anything special to revise PR.
Held off on Bob |
Gah, I did not realize that some of the outputs were higher rank. That's kind of a big problem, now we are going to need to be really careful. I definitely don't think we can just go and do Anyway, this otherwise looks good to me. Would you like to change this so that it does |
I think it best if we leave the current 'predict' alone and rename my version , I chose 'predictbytype'. It avoids all the issues you correctly point out. Altering 'type' is only for the Shapley values and up until now I am the only one that has asked. This would be a function you could point the occasional user to if they request Shapley; if enough ask then can consider rolling it in to one function. If you think a separate function is a bad idea that is okay. Studying your code, I've gotten better at writing personal functions that leverage Lib.jl directly :-) As an aside, many of the current XGBoost papers (in the medical literature) are showing Shapley data; it has come to be expected. Leveraging the libxgboost routines makes these analyses a very simple endeavor. To my eye, Shapley based dependency and partial dependency plots are qualitatively very similar but the train on the tracks - how can you argue with 'game theory'. |
Following up on your reply ( although don't see above). 90+ % of cases the return rank will be 1 and there will be no transpose or reallocation. Under 'normal' use the return rank will be 2 only for multi-class models. What came up for me though is that the return type is This could be due to my inexperience. I am not familiar with MLJ in general or in the specifics of how it implements multi-class XGBoost models particularly with cross validation ( where predict is called repeatedly). Assuming it functions fine with Technically I have taken up too much of your time. |
That said, it's not great that the output of I do think perhaps we should provide functions which are type-stable even if Suffice it to say, I see two options, I have not decided on which is preferable:
We should probably do the latter but again, it's not like |
Theses changes would allow a user access to a feature of libxgboost that reports feature contributions and/or interactions on the record level. This can be useful for Shapley type analyses. The additional data is obtained via specification of the 'type' parameter in Lib.XGBoosterPredictFromDMatrix().
A 'type' parameter is added to predict(); Input values are 0 through 6. The meaning associated with each parameter value is added to the docstring. This is an optional parameter with a default of 0 which provides normal output.
Was not certain what to do with old parameter 'margin' - I removed it as is redundant to the 'type' specification although might cause issue if others have it in and code.
There is significant variation in the dimensions of data returned dependent on the 'type' and booster objective ( multi class objectives return and extra dimension). 'type' 2 and 3 return 2 dimension array. 'type' 4 and 5 return 3 dimensional array. transpose() fails on 3 dimensional array and is replaced with permutedims(). This creates a trade-off in that permutedims() reallocates memory for array although the Matrix Type is more robust than the Transpose Type. For normal prediction(i.e. 'type'=0 where return is vector), there is no additional allocation so this should not impact operations that call predict many times ( for the creation of learning curves/cross validation).
Bob