Skip to content
This repository was archived by the owner on Jun 2, 2022. It is now read-only.

Conversation

@wleftwich
Copy link

Forest -

Following up on our conversation at https://groups.google.com/forum/#!topic/open-source-deduplication/2ZZlVjNtp6I I started working with rlr/crossvalidation.py.

This PR makes two small changes to gridSearch():

  • default to alpha = 0.01 if len(labels) < k
  • only permute examples and labels if arg randomize==True

If it was your intention for randomize to apply to whether each k-fold should be shuffled, let me know and I will fix that.

I tried using sklearn.cross_validation.StratifiedKFold() but did not see much effect with a training set of 36 match and 72 distinct, even with k=10.

I'm a beginner at machine learning, but from what I understand, noisy cross-validation results are a sign of too few training examples. Is that more or less right?

Let me know if my efforts here are helpful at all, and if so am I following the right protocol for contributing to the project.

Thanks again for putting this library out there. It's a big help with my current project and I think I'm learning too.

Wade Leftwich
Ithaca, NY

@fgregg
Copy link
Contributor

fgregg commented Mar 29, 2016

Thanks @wleftwich.

This is a great post to read, along with the posts it links off to.

http://stats.stackexchange.com/questions/103459/how-do-i-know-which-method-of-cross-validation-is-best?lq=1

It looks like a key to reducing the variance will be to do repeated crossvalidation runs. This will definitely mean that the for each run, that we will need to permute the data.

@wleftwich
Copy link
Author

Thanks Forest. Good link, now bookmarked. It looks like 5 x 2 CV is a good place to start.

Is the learner sensitive to the ratio of match to distinct in the training set? My project involves a fairly clean contacts table, so the generated training file is usually 25 match / 75 distinct.

scorePredictions() - if divide by zero, return 0 not nan.
@wleftwich
Copy link
Author

I added repetitions (with shuffling) to k-fold cross-validation, defaulting to 5x3.

Unfortunately the alpha values returned by gridSearch() are still quite noisy. I tested with csv_example; when the training file contains 20 match and 30 distinct, gridSearch() results are just as variable with reps=5 as with reps=1.

@fgregg
Copy link
Contributor

fgregg commented Mar 12, 2017

sorry, what's the status of this @wleftwich do you think this improves the library, still?

@wleftwich
Copy link
Author

Hi Forest -
It's been a while since I did any Deduping, so I don't remember exactly. But with my data, I could not get a consistent value for alpha in spite of adding repetitions to the cross-validation. Ended up not using the gridSearch, and specified alpha manually based on multiple training sessions.

  • Wade

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants