-
Notifications
You must be signed in to change notification settings - Fork 9
Crossvalidation #6
base: master
Are you sure you want to change the base?
Conversation
|
Thanks @wleftwich. This is a great post to read, along with the posts it links off to. It looks like a key to reducing the variance will be to do repeated crossvalidation runs. This will definitely mean that the for each run, that we will need to permute the data. |
|
Thanks Forest. Good link, now bookmarked. It looks like 5 x 2 CV is a good place to start. Is the learner sensitive to the ratio of match to distinct in the training set? My project involves a fairly clean contacts table, so the generated training file is usually 25 match / 75 distinct. |
scorePredictions() - if divide by zero, return 0 not nan.
|
I added repetitions (with shuffling) to k-fold cross-validation, defaulting to 5x3. Unfortunately the alpha values returned by gridSearch() are still quite noisy. I tested with csv_example; when the training file contains 20 match and 30 distinct, gridSearch() results are just as variable with reps=5 as with reps=1. |
|
sorry, what's the status of this @wleftwich do you think this improves the library, still? |
|
Hi Forest -
|
Forest -
Following up on our conversation at https://groups.google.com/forum/#!topic/open-source-deduplication/2ZZlVjNtp6I I started working with rlr/crossvalidation.py.
This PR makes two small changes to gridSearch():
If it was your intention for randomize to apply to whether each k-fold should be shuffled, let me know and I will fix that.
I tried using sklearn.cross_validation.StratifiedKFold() but did not see much effect with a training set of 36 match and 72 distinct, even with k=10.
I'm a beginner at machine learning, but from what I understand, noisy cross-validation results are a sign of too few training examples. Is that more or less right?
Let me know if my efforts here are helpful at all, and if so am I following the right protocol for contributing to the project.
Thanks again for putting this library out there. It's a big help with my current project and I think I'm learning too.
Wade Leftwich
Ithaca, NY