Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation of the impute module #21722

Open
astrojuanlu opened this issue Nov 20, 2021 · 1 comment
Open

Improve documentation of the impute module #21722

astrojuanlu opened this issue Nov 20, 2021 · 1 comment
Labels

Comments

@astrojuanlu
Copy link

Describe the issue linked to the documentation

In #17087, it was requested to make IterativeImputer work with mixed types. This is useful because folks often want to impute categorical data.

In the absence of an official solution, there are a few blog posts out there as well as Stack Overflow answers doing things like creating custom Ordinal/LabelEncoder that preserve the missing categorical values to later apply a KNNImputer (see [1], [2]). Alternatively, folks resort to things like https://pypi.org/project/autoimpute/.

It seems that there are some packages that impute categorical data in R, like https://cran.r-project.org/web/packages/missMDA/index.html. However, it's not clear if there are alternatives in scikit-learn, or even in Python.

In summary, it is very difficult for a non-expert to understand if these approaches are valid, or if they lead to incoherent results for some reason.

Suggest a potential alternative/fix

Even though #17346 was closed, it would be cool to at least document some authoritative way of how (not to) impute missing categorical data.

@glemaitre glemaitre added the Hard Hard level of difficulty label Nov 20, 2021
@glemaitre
Copy link
Member

I agree that we should improve our documentation and probably revamp it.

Our user guide is currently showing how to use the imputer but does not provide any recommendation or insight on the process. We were not having enough insights at the time of writing it.

I think that we should make the following improvements:

  • Add a recommendation section. We should discuss: categorical/continuous imputation, which imputation to use with which type of model, discuss specifically HistGradientBoosting models.
  • Add information regarding the type of missingness (MNAR, MCAR, etc.).
  • Be sure to discuss limitations of the current model (which strategies are adequate with which type of data).
  • Add information regarding transformers being lenient to missing values.

I think that it would be extremely valuable to have inputs from @GaelVaroquaux @marineLM when writing recommendations.

I really think that we should as well solve issues that we have with the IterativeImputer (#14338) but this is another story.

@glemaitre glemaitre changed the title Document how (not to) impute categorical features Improve documentation of the impute module Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants