Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegressorChain support for Pipelines including ColumnTransformer #20557

Open
lokijota opened this issue Jul 18, 2021 · 4 comments
Open

RegressorChain support for Pipelines including ColumnTransformer #20557

lokijota opened this issue Jul 18, 2021 · 4 comments

Comments

@lokijota
Copy link

Describe the bug

I can't seem to get the RegressorChain working with pipelines that include a ColumnTransformer. I posted an issue on StackOverflow with more: https://stackoverflow.com/questions/68430993/sklearn-using-regressorchain-with-columntransformer-in-pipelines .

Somewhere in __init__.py / _get_column_indices(X, key) this call fails: all_columns = X.columns saying 'numpy.ndarray' object has no attribute 'columns'. Because this is a known issue with ColumnTransformer, I suspect the RegressorChain can't be used with it.

I'm not sure if this is a supported scenario, but the documentation for RegressorChain (https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html), for set_params, includes this:

"The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form __ so that it’s possible to update each component of a nested object."

So I was led to assume it would also work with Pipelines including the column transformer.

Steps/Code to Reproduce

Any example with a Pipeline containing a ColumnTransformer and a Regressor. The StackOverflow link I included above has my code.

Expected Results

Fitted pipeline.

Actual Results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    373         try:
--> 374             all_columns = X.columns
    375         except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-181-24da1e03388c> in <module>
      3 
      4 chain_regressor = RegressorChain(base_estimator=chain_pipeline) #, order=[1,0,2])
----> 5 chain_regressor.fit(X, y)
      6 
      7 

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\multioutput.py in fit(self, X, Y, **fit_params)
    840         self : object
    841         """
--> 842         super().fit(X, Y, **fit_params)
    843         return self
    844 

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\multioutput.py in fit(self, X, Y, **fit_params)
    507         for chain_idx, estimator in enumerate(self.estimators_):
    508             y = Y[:, self.order_[chain_idx]]
--> 509             estimator.fit(X_aug[:, :(X.shape[1] + chain_idx)], y,
    510                           **fit_params)
    511             if self.cv is not None and chain_idx < len(self.estimators_) - 1:

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\compose\_target.py in fit(self, X, y, **fit_params)
    205             self.regressor_ = clone(self.regressor)
    206 
--> 207         self.regressor_.fit(X, y_trans, **fit_params)
    208 
    209         return self

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
    339         """
    340         fit_params_steps = self._check_fit_params(**fit_params)
--> 341         Xt = self._fit(X, y, **fit_params_steps)
    342         with _print_elapsed_time('Pipeline',
    343                                  self._log_message(len(self.steps) - 1)):

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
    301                 cloned_transformer = clone(transformer)
    302             # Fit or load from cache the current transformer
--> 303             X, fitted_transformer = fit_transform_one_cached(
    304                 cloned_transformer, X, y, None,
    305                 message_clsname='Pipeline',

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    752     with _print_elapsed_time(message_clsname, message):
    753         if hasattr(transformer, 'fit_transform'):
--> 754             res = transformer.fit_transform(X, y, **fit_params)
    755         else:
    756             res = transformer.fit(X, y, **fit_params).transform(X)

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
    503         self._validate_transformers()
    504         self._validate_column_callables(X)
--> 505         self._validate_remainder(X)
    506 
    507         result = self._fit_transform(X, y, _fit_transform_one)

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_remainder(self, X)
    330         cols = []
    331         for columns in self._columns:
--> 332             cols.extend(_get_column_indices(X, columns))
    333 
    334         remaining_idx = sorted(set(range(self._n_features)) - set(cols))

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    374             all_columns = X.columns
    375         except AttributeError:
--> 376             raise ValueError("Specifying the columns using strings is only "
    377                              "supported for pandas DataFrames")
    378         if isinstance(key, str):

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Versions

System:
python: 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)]
executable: C:\ProgramData\Anaconda3\envs\py38aml\python.exe
machine: Windows-10-10.0.22000-SP0

Python dependencies:
pip: 21.1.3
setuptools: 52.0.0.post20210125
sklearn: 0.24.2
numpy: 1.20.2
scipy: 1.6.2
Cython: None
pandas: 1.2.5
matplotlib: 3.3.4
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

@thomasjpfan
Copy link
Member

Currently the regression chain needs to slice the data which means it is converting the data into numpy arrays before passing it into the base estimator:

X, Y = self._validate_data(X, Y, multi_output=True, accept_sparse=True)

At a glance, I think the implementation can be adjusted to preserve the dataframe, and ultimately passthrough a pandas dataframe into the base estimator.

@glemaitre
Copy link
Member

glemaitre commented Dec 17, 2021

It is somehow quite linked to validation in meta-estimator that we are dealing with with feature_names as well.
It would be a one such case where we should think how to validate the data.

@adrinjalali
Copy link
Member

So can one say we can remove validation here, just use _safe_indexing and solve the issue? (putting it as an easy help wanted one, unless there's something I'm missing)

@thomasjpfan
Copy link
Member

It's a bit more involved because RegressorChain creates a new X_aug with X, which includes y:

if self.cv is None:
Y_pred_chain = Y[:, self.order_]
if sp.issparse(X):
X_aug = sp.hstack((X, Y_pred_chain), format="lil")
X_aug = X_aug.tocsr()
else:
X_aug = np.hstack((X, Y_pred_chain))

Which it will later slice:

estimator.fit(X_aug[:, : (X.shape[1] + chain_idx)], y, **fit_params)

To pass a DataFrame to estimator.fit, RegressorChain.fit would need:

  1. Explicitly support hstack dataframes when generating X_aug
  2. Give a good name to Ys in X_aug.

With that in mind, I'm labeling this as hard.

@thomasjpfan thomasjpfan added Hard Hard level of difficulty and removed Easy Well-defined and straightforward way to resolve labels Oct 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants