Skip to content

Is _generate_sample_indices working correctly ? #21002

@markusloecher

Description

@markusloecher

Describe the bug

I fit two random forests (with just one tree each) with identical parameters, but for the second one I disable the row subsampling by setting bootstrap=False and instead pass the inbag data generated with _generate_sample_indices.

I believe that the only stochastic elements of a random forest are the row and column subsampling. I disable the former by setting bootstrap=False and the latter as well by choosing max_features = p.

Steps/Code to Reproduce

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble._forest import _generate_sample_indices
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt

boston = load_boston()
xtrain, ytrain = boston.data, boston.target
n_samples, p = xtrain.shape

rf = RandomForestRegressor(n_estimators=1, 
    min_samples_leaf =10,max_features = p)
rf.fit(xtrain, ytrain)

tree=rf.estimators_[0]
sampled_indices = _generate_sample_indices(tree.random_state, n_samples, n_samples)

rf0 = RandomForestRegressor(n_estimators=1, 
    min_samples_leaf =10,max_features = p, bootstrap=False)
rf0.fit(xtrain[sampled_indices,:], ytrain[sampled_indices])

p  =  rf.predict(xtrain)
p0 = rf0.predict(xtrain)

plt.scatter(p,p0)

Expected Results

I would expect the two forests (consisting of a single tree) to be identical ?

Actual Results

The predictions are quite different as you can see in the output

Versions

System:
python: 3.8.3 (default, Jul 2 2020, 11:26:31) [Clang 10.0.0 ]
executable: /opt/anaconda3/bin/python
machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
pip: 20.1.1
setuptools: 49.2.0.post20200714
sklearn: 0.23.1
numpy: 1.19.5
scipy: 1.5.0
Cython: 0.29.21
pandas: 1.0.5
matplotlib: 3.2.2
joblib: 0.16.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions