Description
pycaret version checks
-
I have checked that this issue has not already been reported here.
-
I have confirmed this bug exists on the latest version of pycaret.
-
I have confirmed this bug exists on the master branch of pycaret (pip install -U git+https://github.com/pycaret/pycaret.git@master).
Issue Description
Hi,
Enveryone, I am trying to reproduce in the simplest way the following PyCaret code with sklearn (in order to to make sure I understand what PyCaret exactly does). This should be straightforward : the dataset used is extremely simple (House Prices Advanced regression from Kaggle) So, I have just written the following 2 versions, one using PyCaret and the other one with sklearn.
Reproducible Example
****Here is a very simple code with PyCaret (execution output in attachment) :****
import pycaret
from pycaret.regression import *
setup(data = train, target = 'SalePrice', preprocess=True, normalize=True, session_id=3141)
print(get_config('pipeline'))
best = compare_models(sort='RMSE', include = ['lightgbm'], n_select = 1)
predict_model(best);
#finalize_model refits a given model on the entire train dataset.
final_lightgbm = finalize_model(best)
pyCaret_predicted_test_y = predict_model(final_lightgbm, data=test)
pyCaret_predicted_test_y = pd.DataFrame(pyCaret_predicted_test_y['prediction_label'])
**Here is a code with sklearn which should be exactly equivalent and produce exactly the same result :**
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from lightgbm import LGBMRegressor
import pandas as pd
import numpy as np
from scipy.stats import zscore
# Charger les données
train = pd.read_csv(r'C:\Perso\Data Science\house-prices-advanced-regression-techniques\train.csv',sep=',', encoding='latin-1')
# Prétraitement des données
object_feature_mask = train.dtypes == 'object'
object_cols = train.columns[object_feature_mask].tolist()
# Remplir les valeurs manquantes pour les variables catégorielles avec la valeur la plus fréquente
for col in object_cols:
train[col] = train[col].astype('category')
# Conversion des types de données
int_cols = train.select_dtypes(include=['int']).columns.tolist()
for col in int_cols:
train[col] = train[col].astype(np.int32)
numeric_cols = ['OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath',
'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'MoSold']
for col in numeric_cols:
train[col] = train[col].astype(np.int8)
float_cols = train.select_dtypes(include=['float']).columns.tolist()
for col in float_cols:
train[col] = train[col].astype(np.float32)
# Séparation des features et de la target
X_train = train.drop(['SalePrice'], axis=1)
y_train = train['SalePrice']
numeric_columns = train.select_dtypes(include=['number']).columns.tolist()
# Normalisation des features numériques avec zscore
X_train[X_train.columns] = X_train.apply(lambda x: zscore(x) if x.name in numeric_columns else x)
# Division en ensemble d'entraînement et de test
X_train_train, X_test_train, y_train_train, y_test_train = train_test_split(X_train, y_train, train_size=0.7, random_state=3141)
# Initialisation du modèle
lgbm = LGBMRegressor()
# Création d'un scorer pour RMSE
rmse_scorer = make_scorer(mean_squared_error, squared=False)
# Évaluation du modèle avec cross_validate()
cv_results = cross_validate(
lgbm,
X_train_train,
y_train_train,
scoring={'RMSE': rmse_scorer, 'R2': 'r2'},
cv=10,
return_train_score=True,
n_jobs=-1
)
# Affichage des résultats
print(f"RMSE TRAIN: {np.mean(cv_results['test_RMSE']):.4f}")
print(f"R2 TRAIN: {np.mean(cv_results['test_R2']):.4f}")


Expected Behavior
I am expecting both to produce exactly the same result. But actually for some reasons, it is not the case.
Actual Results
PyCaret produce a RMSE of 29891 and a R2 of 0.85 when sklearn produces a RMSE of 29521 and a R2 of 0.85 (see picture in attachment)
Installed Versions
'3.3.2'