-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Describe the bug
I fit two random forests (with just one tree each) with identical parameters, but for the second one I disable the row subsampling by setting bootstrap=False and instead pass the inbag data generated with _generate_sample_indices.
I believe that the only stochastic elements of a random forest are the row and column subsampling. I disable the former by setting bootstrap=False and the latter as well by choosing max_features = p.
Steps/Code to Reproduce
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble._forest import _generate_sample_indices
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
boston = load_boston()
xtrain, ytrain = boston.data, boston.target
n_samples, p = xtrain.shape
rf = RandomForestRegressor(n_estimators=1,
min_samples_leaf =10,max_features = p)
rf.fit(xtrain, ytrain)
tree=rf.estimators_[0]
sampled_indices = _generate_sample_indices(tree.random_state, n_samples, n_samples)
rf0 = RandomForestRegressor(n_estimators=1,
min_samples_leaf =10,max_features = p, bootstrap=False)
rf0.fit(xtrain[sampled_indices,:], ytrain[sampled_indices])
p = rf.predict(xtrain)
p0 = rf0.predict(xtrain)
plt.scatter(p,p0)Expected Results
I would expect the two forests (consisting of a single tree) to be identical ?
Actual Results
The predictions are quite different as you can see in the output
Versions
System:
python: 3.8.3 (default, Jul 2 2020, 11:26:31) [Clang 10.0.0 ]
executable: /opt/anaconda3/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
pip: 20.1.1
setuptools: 49.2.0.post20200714
sklearn: 0.23.1
numpy: 1.19.5
scipy: 1.5.0
Cython: 0.29.21
pandas: 1.0.5
matplotlib: 3.2.2
joblib: 0.16.0
threadpoolctl: 2.1.0
Built with OpenMP: True