Skip to content Skip to sidebar Skip to footer

Recommendation For Mailing Address Matching Scenario?

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change

Solution 1:

Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0).

A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.

Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.

If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:

  1. The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
  2. The Jaccard Similarity coefficient algorithm.
  3. a form of the Jaro-Winkler distance algorithm. 4 a longest common subsequence algorithm (which grows by one when transpositions are involved)

If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.

You can easily create your own metric using NGrams8K.

For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).

I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

Post a Comment for "Recommendation For Mailing Address Matching Scenario?"