Improve documentation of the `impute` module #21722

astrojuanlu · 2021-11-20T08:54:03Z

Describe the issue linked to the documentation

In #17087, it was requested to make IterativeImputer work with mixed types. This is useful because folks often want to impute categorical data.

In the absence of an official solution, there are a few blog posts out there as well as Stack Overflow answers doing things like creating custom Ordinal/LabelEncoder that preserve the missing categorical values to later apply a KNNImputer (see [1], [2]). Alternatively, folks resort to things like https://pypi.org/project/autoimpute/.

It seems that there are some packages that impute categorical data in R, like https://cran.r-project.org/web/packages/missMDA/index.html. However, it's not clear if there are alternatives in scikit-learn, or even in Python.

In summary, it is very difficult for a non-expert to understand if these approaches are valid, or if they lead to incoherent results for some reason.

Suggest a potential alternative/fix

Even though #17346 was closed, it would be cool to at least document some authoritative way of how (not to) impute missing categorical data.

The text was updated successfully, but these errors were encountered:

glemaitre · 2021-11-20T12:31:00Z

I agree that we should improve our documentation and probably revamp it.

Our user guide is currently showing how to use the imputer but does not provide any recommendation or insight on the process. We were not having enough insights at the time of writing it.

I think that we should make the following improvements:

Add a recommendation section. We should discuss: categorical/continuous imputation, which imputation to use with which type of model, discuss specifically HistGradientBoosting models.
Add information regarding the type of missingness (MNAR, MCAR, etc.).
Be sure to discuss limitations of the current model (which strategies are adequate with which type of data).
Add information regarding transformers being lenient to missing values.

I think that it would be extremely valuable to have inputs from @GaelVaroquaux @marineLM when writing recommendations.

I really think that we should as well solve issues that we have with the IterativeImputer (#14338) but this is another story.

astrojuanlu added the Documentation label Nov 20, 2021

glemaitre added the Hard Hard level of difficulty label Nov 20, 2021

glemaitre changed the title ~~Document how (not to) impute categorical features~~ Improve documentation of the impute module Nov 22, 2021

cmarmo added the module:impute label Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve documentation of the `impute` module #21722

Improve documentation of the `impute` module #21722

astrojuanlu commented Nov 20, 2021

glemaitre commented Nov 20, 2021

Improve documentation of the impute module #21722

Improve documentation of the impute module #21722

Comments

astrojuanlu commented Nov 20, 2021

Describe the issue linked to the documentation

Suggest a potential alternative/fix

glemaitre commented Nov 20, 2021

Improve documentation of the `impute` module #21722

Improve documentation of the `impute` module #21722