Hi I am working with python 3
and I've been facing this issue for a while now and I can't seem to figure this out.
array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
If you notice, the array_one
is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram
for the sentence alice in a wonder land
Now I have another numpy array
that contains some locations and names.
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])
Now what I want to do is get all the elements in the array_one
that exist in array_two
If I take out an intersection using np.intersect1d
of the two arrays I don't get any matches since wonderland
is two separate words in array_one
while in array_two
it's a single word.
Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3
I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space
from both arrays
and then using the resultant boolean
array ([True, False, True]) to `filter on the origional array. Below is the code:
import numpy.core.defchararray as np_f
import numpy as np
array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]
Sorry to post two answers, but after adding the locality-sensitive-hashing technique above, I realized you could exploit the class separation in your data (query vectors and potential matching vectors) by using a bloom filter.
A bloom filter is a beautiful object that lets you pass in some objects, then query to see whether a given object has been added to the bloom filter. Here's an awesome visual demo of a bloom filter.
In your case we can add each member of array_two
to the bloom filter, then query each member of array_one
to see whether it's in the bloom filter. Using pip install bloom-filter
from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re
def clean(s):
'''Clean a string'''
return re.sub(r'\s+', '', s)
array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])
# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
Depending on your requirements, this could be a very efficient (and highly-scalable) solution.