srakamin.blogg.se - String similarity

#String similarity generator#
#String similarity code#

#String similarity generator#

Writer = csv.writer(f1, delimiter='\t', lineterminator='\n', )įor s1, s2, r in pool.imap_unordered(ratio, binations(Dishes, 2)):Īll told, I'd expect these changes to give you 5-10x speedup, depending heavily on the number of cores you have available.įor reference, a generator comprehension version of this might look something like: with mp.Pool() as pool:įor s1, s2, r in pool.imap_unordered(ratio, binations(Dishes, 2)) Return s1, s2, fuzz.token_sort_ratio(s1, s2) imap_unordered: import multiprocessing as mp The least intrusive way to do this is to use or. On an i7 (8 virtual cores, 4 physical), you could probably expect this to give you ~4-8x speedup, but that depends on a lot of factors.

#String similarity code#

This code also lends itself easily to parallelization. There are half as many combinations as there are permutations, so that gives you a free 2x speedup. This assumes fuzz.token_sort_ratio(str_1, str_2) = fuzz.token_sort_ratio(str_2, str_1). permutations, since you don't care about order. The first algorithmic recommendation is to use binations instead of. If 2 * length / (length + len(dish2)) = 0.85 and matcher.ratio() >= 0.85: # should also try without quick_ratio() check Matcher = fuzz.SequenceMatcher(None, dish)įor idx2 in range(idx + 1, len(processedDishes)): ProcessedDishes.sort(key= lambda x: len(x))įor idx, dish in enumerate(processedDishes): Update: checked how these ratios are calculated, here is a more efficient answer that avoids a lot of checks between pairs: dishes = Ratio = int(round(100 * matcher.ratio())) Matchers.append(fuzz.SequenceMatcher(None, processedDish))

ProcessedDish = fuzz._process_and_sort(dish, True, True) If you know your data is all the same type, you can further optimize it: dishes = You can then combine this with solution to add multiprocessing ProcessedDishes.append(fuzz._process_and_sort(dish, True, True))įor dish1, dish2 in binations(enumerate(processedDishes), 2):

To avoid processing each dish so many times, you can use this to process them only 1 time: dishes =