Tag: Record Linkage
-
How Machine Learning Products are Different (Part 2, Entity Resolution Checklists)
Last time I talked a bit about the context of my experience with Machine Learning products and the high level issues we had getting customers to switch to what was clearly a better product. This time I get into some technical examples from these checklists and try to demonstrate the conflict. The first and most…
-
How Machine Learning Products are Different (Part 1, High Level)
First some context. My direct experience over the last decade+ has been mostly B2B selling enterprise grade software to large corporates and so my perspective is skewed this way. I have had many conversations with folks working in the small business and consumer spaces and know those can be very different worlds. When you sell…
-
Developing an Algorithm in F#: Fast Rotational Alignments with Gosper’s Hack
This post is for the 7th day of the 2014 F# Advent Calendar. It has been said that functional languages can’t be as fast as their imperative cousins because of all of the allocation and garbage collection, this is patently false (as far as F# is concerned at least) largely because the CLR has value types.…
-
Levenshtein Distance and the Triangle Inequality
Levenshtein distance is one of my favorite algorithms. On the surface it seems so very simple, but when you spend some time thinking hard on it deep insights are waiting to be had. The first and most important thing about Levenshtein distance is it’s actually a metric distance. That is, it obeys the triangle inequality. For…
-
Record Linkage Algorithms in F# – Extensions to Jaro-Winkler Distance (Part 3)
While writing the previous article on tokenized matching I realized I left out some important background information on Jaro-Winkler distance. First, there’s something important to know about the Jaro-Winkler distance: it’s not a metric distance and so does not obey the triangle inequality. That is, if you found the JW distance between strings A and B, and then…
-
Imperative Pseudocode to Pure Functional Algorithm with Gale-Shapely and F#
Continuing from last time, let’s look at how one goes from imperative pseudocode to pure functional using Gale-Shapely as an example. Overall, to convert an algorithm from imperative to functional is a fairly simple process once you understand how to convert from a while loop to recursion with accumulators. This post is just a more advanced…
-
Record Linkage in F# – Token Matching, Stable Marriages and the Gale-Shapley algorithm
This post was originally published September 13th, 2011. Initially, one of the biggest problems I found when trying to marry records was the god awful quality of much of data I often have to work with. It’s mostly old mainframe and database data with truncated fields, limited character sets and fields with nonsensical contents. Even…
-
Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 2)
Last time we dove into the Jaro distance algorithm and picked apart how each of its components are calculated. However, from a modern perspective Jaro alone is a rather weak method of string matching. It was Winkler’s extension that brought this algorithm into widespread modern use. Matthew Jaro’s insight when inventing the Jaro distance algorithm was that…
-
Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 1)
When first approaching the task of record linkage I was initially overwhelmed by the huge number of different algorithms available for comparing strings. Now I know that the secret to finding your way in this sea of algorithms is two fold. First, know that many are outdated and have newer and better implementations, so they can be…