-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Summary
There exists other criterions, where additional arguments would need to be passed in. The current Criterion class is not actually a generic Criterion class since it explicitly assumes that the Criterion is supervised-learning based and does NOT require the data X itself. The init(...) function explicitly passes in y.
I am trying to implement and subclass Criterion to implement Unsupervised Random Forests (https://arxiv.org/abs/1907.02844), which use the data within the Criterion. I would like to subclass the Criterion class for my own UnsupervisedCriterion. However, the init function assumes access to a y label, but everything else about it is fine and correct.
Instead I would like to define my own init function, but keep the rest of the Criterion API. This way, I can define an unsupervised criterion with the init function:
cdef int init(self, const DTYPE_T[:, ::1] X, DOUBLE_t* sample_weight,
double weighted_n_samples, SIZE_t* samples, SIZE_t start,
SIZE_t end) nogil except -1and I could then define the rest of the functions, while making sure I am in strict compliance with sklearn's Tree submodule.
Proposed Solution
Add a class called BaseCriterion and migrate all functions/properties of the current Criterion class, EXCEPT for init function to BaseCriterion.
This is completely backwards-compatable and ensures 3rd party API can leverage the existing API contract for Criterions using the BaseCriterion class.
Alternatives
Alternatively, I can do what scikit-survival does and pass in this additional data to __cinit__, but this is weird because you basically need to pass in data to a Criterion class before it's used with Tree/Splitter. Rn init is called within the Tree/Splitter, which ensures the y that Tree/Splitter have access to is the same as Criterion. If you are forced to pass things through __cinit__ it is quite possible to introduce runtime bugs because your data changes after you have already instantiated a criterion class.
Alternatives2
Another possibility that I'm not 100% sure on is that Criterion.init(...) is passed y and just resets it to the passed in y data. However, it never modifies y. Instead, it should only init the 'y' variable once.... Perhaps something like Criterion.init_data(y), which separates initialization of the data compared to initialization of different variables.