You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #17087, it was requested to make IterativeImputer work with mixed types. This is useful because folks often want to impute categorical data.
In the absence of an official solution, there are a few blog posts out there as well as Stack Overflow answers doing things like creating custom Ordinal/LabelEncoder that preserve the missing categorical values to later apply a KNNImputer (see [1], [2]). Alternatively, folks resort to things like https://pypi.org/project/autoimpute/.
I agree that we should improve our documentation and probably revamp it.
Our user guide is currently showing how to use the imputer but does not provide any recommendation or insight on the process. We were not having enough insights at the time of writing it.
I think that we should make the following improvements:
Add a recommendation section. We should discuss: categorical/continuous imputation, which imputation to use with which type of model, discuss specifically HistGradientBoosting models.
Add information regarding the type of missingness (MNAR, MCAR, etc.).
Be sure to discuss limitations of the current model (which strategies are adequate with which type of data).
Add information regarding transformers being lenient to missing values.
I think that it would be extremely valuable to have inputs from @GaelVaroquaux@marineLM when writing recommendations.
I really think that we should as well solve issues that we have with the IterativeImputer (#14338) but this is another story.
glemaitre
changed the title
Document how (not to) impute categorical features
Improve documentation of the impute module
Nov 22, 2021
Describe the issue linked to the documentation
In #17087, it was requested to make
IterativeImputer
work with mixed types. This is useful because folks often want to impute categorical data.In the absence of an official solution, there are a few blog posts out there as well as Stack Overflow answers doing things like creating custom
Ordinal/LabelEncoder
that preserve the missing categorical values to later apply aKNNImputer
(see [1], [2]). Alternatively, folks resort to things like https://pypi.org/project/autoimpute/.It seems that there are some packages that impute categorical data in R, like https://cran.r-project.org/web/packages/missMDA/index.html. However, it's not clear if there are alternatives in scikit-learn, or even in Python.
In summary, it is very difficult for a non-expert to understand if these approaches are valid, or if they lead to incoherent results for some reason.
Suggest a potential alternative/fix
Even though #17346 was closed, it would be cool to at least document some authoritative way of how (not to) impute missing categorical data.
The text was updated successfully, but these errors were encountered: