RegressorChain support for Pipelines including ColumnTransformer #20557

lokijota · 2021-07-18T16:46:16Z

Describe the bug

I can't seem to get the RegressorChain working with pipelines that include a ColumnTransformer. I posted an issue on StackOverflow with more: https://stackoverflow.com/questions/68430993/sklearn-using-regressorchain-with-columntransformer-in-pipelines .

Somewhere in __init__.py / _get_column_indices(X, key) this call fails: all_columns = X.columns saying 'numpy.ndarray' object has no attribute 'columns'. Because this is a known issue with ColumnTransformer, I suspect the RegressorChain can't be used with it.

I'm not sure if this is a supported scenario, but the documentation for RegressorChain (https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html), for set_params, includes this:

"The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form __ so that it’s possible to update each component of a nested object."

So I was led to assume it would also work with Pipelines including the column transformer.

Steps/Code to Reproduce

Any example with a Pipeline containing a ColumnTransformer and a Regressor. The StackOverflow link I included above has my code.

Expected Results

Fitted pipeline.

Actual Results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    373         try:
--> 374             all_columns = X.columns
    375         except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-181-24da1e03388c> in <module>
      3 
      4 chain_regressor = RegressorChain(base_estimator=chain_pipeline) #, order=[1,0,2])
----> 5 chain_regressor.fit(X, y)
      6 
      7 

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\multioutput.py in fit(self, X, Y, **fit_params)
    840         self : object
    841         """
--> 842         super().fit(X, Y, **fit_params)
    843         return self
    844 

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\multioutput.py in fit(self, X, Y, **fit_params)
    507         for chain_idx, estimator in enumerate(self.estimators_):
    508             y = Y[:, self.order_[chain_idx]]
--> 509             estimator.fit(X_aug[:, :(X.shape[1] + chain_idx)], y,
    510                           **fit_params)
    511             if self.cv is not None and chain_idx < len(self.estimators_) - 1:

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\compose\_target.py in fit(self, X, y, **fit_params)
    205             self.regressor_ = clone(self.regressor)
    206 
--> 207         self.regressor_.fit(X, y_trans, **fit_params)
    208 
    209         return self

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
    339         """
    340         fit_params_steps = self._check_fit_params(**fit_params)
--> 341         Xt = self._fit(X, y, **fit_params_steps)
    342         with _print_elapsed_time('Pipeline',
    343                                  self._log_message(len(self.steps) - 1)):

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
    301                 cloned_transformer = clone(transformer)
    302             # Fit or load from cache the current transformer
--> 303             X, fitted_transformer = fit_transform_one_cached(
    304                 cloned_transformer, X, y, None,
    305                 message_clsname='Pipeline',

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    752     with _print_elapsed_time(message_clsname, message):
    753         if hasattr(transformer, 'fit_transform'):
--> 754             res = transformer.fit_transform(X, y, **fit_params)
    755         else:
    756             res = transformer.fit(X, y, **fit_params).transform(X)

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
    503         self._validate_transformers()
    504         self._validate_column_callables(X)
--> 505         self._validate_remainder(X)
    506 
    507         result = self._fit_transform(X, y, _fit_transform_one)

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_remainder(self, X)
    330         cols = []
    331         for columns in self._columns:
--> 332             cols.extend(_get_column_indices(X, columns))
    333 
    334         remaining_idx = sorted(set(range(self._n_features)) - set(cols))

C:\ProgramData\Anaconda3\envs\py38aml\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    374             all_columns = X.columns
    375         except AttributeError:
--> 376             raise ValueError("Specifying the columns using strings is only "
    377                              "supported for pandas DataFrames")
    378         if isinstance(key, str):

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Versions

System:
python: 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)]
executable: C:\ProgramData\Anaconda3\envs\py38aml\python.exe
machine: Windows-10-10.0.22000-SP0

Python dependencies:
pip: 21.1.3
setuptools: 52.0.0.post20210125
sklearn: 0.24.2
numpy: 1.20.2
scipy: 1.6.2
Cython: None
pandas: 1.2.5
matplotlib: 3.3.4
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2021-07-23T14:54:32Z

Currently the regression chain needs to slice the data which means it is converting the data into numpy arrays before passing it into the base estimator:

scikit-learn/sklearn/multioutput.py

Line 521 in a5a9d17

X, Y = self._validate_data(X, Y, multi_output=True, accept_sparse=True)

At a glance, I think the implementation can be adjusted to preserve the dataframe, and ultimately passthrough a pandas dataframe into the base estimator.

glemaitre · 2021-12-17T18:39:21Z

It is somehow quite linked to validation in meta-estimator that we are dealing with with feature_names as well.
It would be a one such case where we should think how to validate the data.

adrinjalali · 2022-10-04T10:28:53Z

So can one say we can remove validation here, just use _safe_indexing and solve the issue? (putting it as an easy help wanted one, unless there's something I'm missing)

thomasjpfan · 2022-10-09T21:20:18Z

It's a bit more involved because RegressorChain creates a new X_aug with X, which includes y:

scikit-learn/sklearn/multioutput.py

Lines 607 to 613 in 4ee3fdd

    
           if self.cv is None: 
        
               Y_pred_chain = Y[:, self.order_] 
        
               if sp.issparse(X): 
        
                   X_aug = sp.hstack((X, Y_pred_chain), format="lil") 
        
                   X_aug = X_aug.tocsr() 
        
               else: 
        
                   X_aug = np.hstack((X, Y_pred_chain))

Which it will later slice:

scikit-learn/sklearn/multioutput.py

Line 633 in 4ee3fdd

estimator.fit(X_aug[:, : (X.shape[1] + chain_idx)], y, **fit_params)

To pass a DataFrame to estimator.fit, RegressorChain.fit would need:

Explicitly support hstack dataframes when generating X_aug
Give a good name to Ys in X_aug.

With that in mind, I'm labeling this as hard.

lokijota added the Bug: triage label Jul 18, 2021

glemaitre added Bug and removed Bug: triage labels Dec 17, 2021

cmarmo added the module:preprocessing label Sep 13, 2022

adrinjalali added Easy Well-defined and straightforward way to resolve help wanted labels Oct 4, 2022

adrinjalali mentioned this issue Oct 4, 2022

RegressionChain does not accept nans, when base_estimator does #23109

Closed

thomasjpfan added Hard Hard level of difficulty and removed Easy Well-defined and straightforward way to resolve labels Oct 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegressorChain support for Pipelines including ColumnTransformer #20557

RegressorChain support for Pipelines including ColumnTransformer #20557

lokijota commented Jul 18, 2021

thomasjpfan commented Jul 23, 2021

glemaitre commented Dec 17, 2021 •

edited

Loading

adrinjalali commented Oct 4, 2022

thomasjpfan commented Oct 9, 2022

RegressorChain support for Pipelines including ColumnTransformer #20557

RegressorChain support for Pipelines including ColumnTransformer #20557

Comments

lokijota commented Jul 18, 2021

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

thomasjpfan commented Jul 23, 2021

glemaitre commented Dec 17, 2021 • edited Loading

adrinjalali commented Oct 4, 2022

thomasjpfan commented Oct 9, 2022

glemaitre commented Dec 17, 2021 •

edited

Loading