Modify Levenshtein-distance To Ignore Order
I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance. How would I implement thi
Solution 1:
You don't need the Levenshtein machinery for this.
import collections
def distance(s1, s2):
cnt = collections.Counter()forcin s1:
cnt[c]+=1forcin s2:
cnt[c]-=1returnsum(abs(diff)for diff in cnt.values())//2+\(abs(sum(cnt.values()))+1)//2# can be omitted if len(s1) == len(s2)
Solution 2:
Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".
Pseudocode:
forcin s1:
cnt1[c]++forcin s2:
cnt2[c]++
extra1 =0
extra2 =0forcin all_chars:if cnt1[c]>cnt2[c]
extra1 += cnt1[c]-cnt2[c]else
extra2 += cnt2[c]-cnt1[c]returnmax(extra1, extra2)
Solution 3:
this might be late but I think it can help someone and I also still looking an improvement. The challenge I had was:
match_function('kigali rwanda','rwanda kigali') probable match percentage should be 100%
match_function('kigali','ligaki') probable match percentage should be +50% ... I wrote a funny function in T-SQL using cross join and Levenstein and it helped at some point but I still need an improvement:
CreateFUNCTION [dbo].[GetPercentageMatch](@leftVARCHAR(100),@rightVARCHAR(100)) RETURNSDECIMALASBEGINDECLARE@returnvalueDECIMAL(5, 2); DECLARE@list1TABLE(valueVARCHAR(50)); declare@count1int, @count2int, @matchPercint; INSERTINTO@list1 (value) selectvaluefrom STRING_SPLIT(@left, ' '); DECLARE@list2TABLE(valueVARCHAR(50)); INSERTINTO@list2 (value) select*from STRING_SPLIT(@right, ' '); select@count1=count(*) from@list1select@count2=count(*) from@list2select@matchPerc= (r3.percSum/casewhen@count1>@count2then@count1else@count2end) from ( selectcount(r2.l1) rCount, sum(r2.perc) percSum from( select r.t1, r.t2, r.distance, (100-((r.distance*100)/(casewhen len(r.t1) > len(r.t2) then len(r.t1) else len(r.t2) end))) perc, len(r.t1) l1,len(r.t2)l2 from (select isnull(t1.value,'') t1, isnull(t2.value,'') t2, [dbo].[LEVENSHTEIN](isnull(t1.value,''),isnull(t2.value,'')) distance from@list1 t1 crossjoin@list2 t2 ) as r ) r2 ) r3 returncasewhen@matchPerc>100then100else@matchPercendEND;
Post a Comment for "Modify Levenshtein-distance To Ignore Order"