问题描述
我正在与python 3
一起工作,并且我已经面对了一段时间了,我似乎无法弄清楚.
Hi I am working with python 3
and I've been facing this issue for a while now and I can't seem to figure this out.
我有2个包含strings
array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
如果您注意到,array_one
实际上是一个包含1-gram, 2-gram, 3-gram, 4-gram, 5-gram
句子alice in a wonder land
的数组.
If you notice, the array_one
is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram
for the sentence alice in a wonder land
.
现在我有另一个numpy array
,其中包含一些位置和名称.
Now I have another numpy array
that contains some locations and names.
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])
现在我要做的是获取array_two
中存在的array_one
中的所有元素.
Now what I want to do is get all the elements in the array_one
that exist in array_two
.
如果我使用两个数组中的np.intersect1d
提取交集,则不会得到任何匹配项,因为wonderland
是array_one
中的两个独立单词,而在array_two
中则是单个单词.
If I take out an intersection using np.intersect1d
of the two arrays I don't get any matches since wonderland
is two separate words in array_one
while in array_two
it's a single word.
有没有办法做到这一点?我已经尝试过使用堆栈的解决方案(此),但是它们似乎不适用于python 3
Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3
编辑
由于我现在无法找到解决方案,因此我使用了一种非常幼稚的方法,我从两个arrays
中都替换了white space
,然后使用了生成的boolean
数组([True,False,True ])过滤原始数组.下面是代码:
Edit
I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space
from both arrays
and then using the resultant boolean
array ([True, False, True]) to `filter on the origional array. Below is the code:
import numpy.core.defchararray as np_f
import numpy as np
array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]
推荐答案
很抱歉提出两个答案,但是在添加了上面的局部敏感哈希技术之后,我意识到您可以利用数据中的类分离(查询向量和潜在匹配向量),使用布隆过滤器.
Sorry to post two answers, but after adding the locality-sensitive-hashing technique above, I realized you could exploit the class separation in your data (query vectors and potential matching vectors) by using a bloom filter.
Bloom过滤器是一个漂亮的对象,可以让您传入一些对象,然后查询以查看是否已将给定对象添加到Bloom过滤器中.这是一个 Bloom过滤器的绝佳视觉演示.
A bloom filter is a beautiful object that lets you pass in some objects, then query to see whether a given object has been added to the bloom filter. Here's an awesome visual demo of a bloom filter.
在您的情况下,我们可以将array_two
的每个成员添加到Bloom过滤器中,然后查询array_one
的每个成员以查看它是否在Bloom过滤器中.使用pip install bloom-filter
:
In your case we can add each member of array_two
to the bloom filter, then query each member of array_one
to see whether it's in the bloom filter. Using pip install bloom-filter
:
from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re
def clean(s):
'''Clean a string'''
return re.sub(r'\s+', '', s)
array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])
# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)
结果:['wonder land']
根据您的要求,这可能是一个非常有效(且高度可扩展)的解决方案.
Depending on your requirements, this could be a very efficient (and highly-scalable) solution.
这篇关于将字符串从一个numpy数组匹配到另一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!