Skip to content

[BUG]: Different behaviour between PyCaret and Sklearn #4096

Open
@sharouat

Description

@sharouat

pycaret version checks

Issue Description

Hi,

Enveryone, I am trying to reproduce in the simplest way the following PyCaret code with sklearn (in order to to make sure I understand what PyCaret exactly does). This should be straightforward : the dataset used is extremely simple (House Prices Advanced regression from Kaggle) So, I have just written the following 2 versions, one using PyCaret and the other one with sklearn.

Reproducible Example

****Here is a very simple code with PyCaret (execution output in attachment) :**** 

import pycaret 
from pycaret.regression import *

setup(data = train, target = 'SalePrice', preprocess=True, normalize=True, session_id=3141)

print(get_config('pipeline'))

best = compare_models(sort='RMSE', include = ['lightgbm'], n_select = 1)

predict_model(best);

#finalize_model refits a given model on the entire train dataset.
final_lightgbm = finalize_model(best)

pyCaret_predicted_test_y = predict_model(final_lightgbm, data=test)
pyCaret_predicted_test_y = pd.DataFrame(pyCaret_predicted_test_y['prediction_label'])

**Here is a code with sklearn which should be exactly equivalent and produce exactly the same result :**

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from lightgbm import LGBMRegressor
import pandas as pd
import numpy as np
from scipy.stats import zscore

# Charger les données
train = pd.read_csv(r'C:\Perso\Data Science\house-prices-advanced-regression-techniques\train.csv',sep=',', encoding='latin-1')

# Prétraitement des données
object_feature_mask = train.dtypes == 'object'
object_cols = train.columns[object_feature_mask].tolist()

# Remplir les valeurs manquantes pour les variables catégorielles avec la valeur la plus fréquente
for col in object_cols:
   train[col] = train[col].astype('category')

# Conversion des types de données
int_cols = train.select_dtypes(include=['int']).columns.tolist()
for col in int_cols:
    train[col] = train[col].astype(np.int32)

numeric_cols = ['OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 
                'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 
                'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'MoSold']
for col in numeric_cols:
    train[col] = train[col].astype(np.int8)

float_cols = train.select_dtypes(include=['float']).columns.tolist()
for col in float_cols:
    train[col] = train[col].astype(np.float32)

# Séparation des features et de la target
X_train = train.drop(['SalePrice'], axis=1)
y_train = train['SalePrice']

numeric_columns = train.select_dtypes(include=['number']).columns.tolist()


# Normalisation des features numériques avec zscore
X_train[X_train.columns] = X_train.apply(lambda x: zscore(x) if x.name in numeric_columns else x)

# Division en ensemble d'entraînement et de test
X_train_train, X_test_train, y_train_train, y_test_train = train_test_split(X_train, y_train, train_size=0.7, random_state=3141)

# Initialisation du modèle
lgbm = LGBMRegressor()

# Création d'un scorer pour RMSE
rmse_scorer = make_scorer(mean_squared_error, squared=False)

# Évaluation du modèle avec cross_validate()
cv_results = cross_validate(
    lgbm, 
    X_train_train, 
    y_train_train, 
    scoring={'RMSE': rmse_scorer, 'R2': 'r2'}, 
    cv=10, 
    return_train_score=True,
    n_jobs=-1
)

# Affichage des résultats
print(f"RMSE TRAIN: {np.mean(cv_results['test_RMSE']):.4f}")
print(f"R2 TRAIN: {np.mean(cv_results['test_R2']):.4f}")




![Sklearn](https://github.com/user-attachments/assets/dc0bab30-c5bc-47bc-9aab-7557c6bbd46b)
![PyCaret](https://github.com/user-attachments/assets/c0632ac1-9c4a-496d-8cf4-c115e8913c89)

Expected Behavior

I am expecting both to produce exactly the same result. But actually for some reasons, it is not the case.
Sklearn
PyCaret

Actual Results

PyCaret produce a RMSE of 29891 and a R2 of 0.85 when sklearn produces a RMSE of 29521 and a R2 of 0.85 (see picture in attachment)

Installed Versions

'3.3.2'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions