You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm seeing very slow unpickling times for multilabel random forests. I'm opening this as an issue re: my discussion with @ogrisel on Stack Overflow.
trainX is my training feature matrix. It is a sparse matrix with shape (926, 1236). validX is the feature matrix I want to categorize. By coincidence, it has the same size as the training matrix. Y is a numpy array of python lists (since this is a multilabel problem.). Each sample has at most 4 labels, with most having only one. There are 52 unique labels.
I can include the rest, but the bulk of the time is spend on __init__.py:93(__RandomStat_ctor).
Note that this is actually a slightly smaller example than the one I was talking about on stack overflow (for the sake of ease). It still illustrates the problem.
The text was updated successfully, but these errors were encountered:
Can you include the full trace please? To me it looks more like an issue related to pickle, but I may be wrong. (Note that the object that you are serializing is anyway quite complex, since it includes 52*15 trees.)
is the culprit. We should probably try to see if it's not possible to speed things up by deleting the random_state attribute of the trees before pickling them (if it's the same of the instance of the ensemble) and then reset them at unpickling time by setting all the trees to share the instance of the ensemble model.
This should now be fixed in master. @AWinterman can you please check that this indeed solves your performance problem? If not please feel free to reopen a new issue with the new profiler output.
I'm seeing very slow unpickling times for multilabel random forests. I'm opening this as an issue re: my discussion with @ogrisel on Stack Overflow.
trainX is my training feature matrix. It is a sparse matrix with shape (926, 1236). validX is the feature matrix I want to categorize. By coincidence, it has the same size as the training matrix. Y is a numpy array of python lists (since this is a multilabel problem.). Each sample has at most 4 labels, with most having only one. There are 52 unique labels.
Output:
I can include the rest, but the bulk of the time is spend on
__init__.py:93(__RandomStat_ctor)
.Note that this is actually a slightly smaller example than the one I was talking about on stack overflow (for the sake of ease). It still illustrates the problem.
The text was updated successfully, but these errors were encountered: