Combine Words With Nearest Index

June 21, 2023 Post a Comment

I have a file of terms with their index in context in two languages, as this format 1. (2- human rights, 10- workers rights)>> (3- droits de l'homme, 7- droit des travailleu

Solution 1:

Not complete but should will get you started:

from bisect import bisect
import re

with open("test.txt") as f:
    r = re.compile("(\d+)")
    for line in f:
        a, b = line.lstrip("0123456789. ").split(">> ")
        a_keys = [int(i.group()) for i in r.finditer(a)]
        b_keys =  [int(i.group()) for i in r.finditer(b)]
        a = a.strip("()\n").split(",")
        b = b.strip("()\n").split(",")
        for ele, s in zip(a, a_keys):
            ind = bisect(b_keys, s, hi=len(b) - 1)
            print("{} -> {}".format(ele, b[ind]))

Output:

2- human rights ->3- droits de l'homme10- workers rights ->7- droit des travailleurs
2- human rights ->5- droits de l'homme10- workers rights ->15- les droits des femmes
 19- women rights ->15- les droits des femmes

You need to fix the formatting and do one more check to find the min based on the absolute difference of the ele at ind and ind -1.

To catch where the absolute differece of the previous ind-1 element is less:

from bisect import bisect
import re

with open("test.txt") as f:
    r = re.compile("(\d+)")
    for line in f:
        a, b = line.lstrip("0123456789. ").split(">> ")
        a_keys = [int(i.group()) for i in r.finditer(a)]
        b_keys = [int(i.group()) for i in r.finditer(b)]
        a = a.strip("()\n").split(",")
        b = b.strip("()\n").split(",")
        for ele, k in zip(a, a_keys):
            ind = bisect(b_keys, k, hi=len(b) - 1)
            ind -= k - b_keys[ind] < b_keys[ind-1] - k
            print("{} -> {}".format(ele, b[ind]))

So for:

1. (2- human rights, 10- workers rights)>> (3- droits de l'homme, 7- droit des travailleurs)
2. (2- human rights, 10- workers rights, 19- women rights)>> (1- droits de l'homme ,4- foobar, 15- les droits des femmes)

We get:

2- human rights ->3- droits de l'homme10- workers rights ->7- droit des travailleurs
2- human rights ->1- droits de l'homme10- workers rights ->15- les droits des femmes
 19- women rights ->15- les droits des femmes

The original code would output 2- human rights -> 4- foobar as we did not consider where the absolute difference of previous element is less.

Using the data in your comment shows the difference:

l1 = [10, 33, 50, 67]
l2 = [7, 16, 29, 55]

for s in l1:
    ind = bisect(l2, s, hi=len(l2) - 1)
    print("{} -> {}".format(s, l2[ind]))

Output:

10->1633->5550->5567->55

Now with checking the previous element:

l1 = [10, 33, 50, 67]
l2 = [7, 16, 29, 55]

for s in l1:
    ind = bisect(l2, s, hi=len(l2) - 1)
    ind -= s - l2[ind-1] < l2[ind] - s
    print("{} -> {}".format(s, l2[ind]))

Output:

10->733->2950->5567->55

bisect.bisect

Similar to bisect_left(), but returns an insertion point which comes after (to the right of) any existing entries of x in a. The returned insertion point i partitions the array a into two halves so that all(val <= x for val in a[lo:i]) for the left side and all(val > x for val in a[i:hi]) for the right side.

So bisecting gets where the element should land in your ordered list of numbers with all elements less than positioned to the left of the element so that means the element is greater than all previous. To find the closest based on the difference we need to check the previous element as the abs difference may be less.

Python Playground

Combine Words With Nearest Index

Solution 1:

Post a Comment for "Combine Words With Nearest Index"