将字符串从一个numpy数组匹配到另一个

本文介绍了将字符串从一个numpy数组匹配到另一个的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在与python 3一起工作，并且我已经面对了一段时间了，我似乎无法弄清楚.

Hi I am working with python 3 and I've been facing this issue for a while now and I can't seem to figure this out.

我有2个包含strings

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])

如果您注意到，array_one实际上是一个包含1-gram, 2-gram, 3-gram, 4-gram, 5-gram句子alice in a wonder land的数组.

If you notice, the array_one is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram for the sentence alice in a wonder land.

现在我有另一个numpy array，其中包含一些位置和名称.

Now I have another numpy array that contains some locations and names.

array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

现在我要做的是获取array_two中存在的array_one中的所有元素.

Now what I want to do is get all the elements in the array_one that exist in array_two.

如果我使用两个数组中的np.intersect1d提取交集，则不会得到任何匹配项，因为wonderland是array_one中的两个独立单词，而在array_two中则是单个单词.

If I take out an intersection using np.intersect1d of the two arrays I don't get any matches since wonderland is two separate words in array_one while in array_two it's a single word.

有没有办法做到这一点?我已经尝试过使用堆栈的解决方案(此)，但是它们似乎不适用于python 3

Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3

编辑

由于我现在无法找到解决方案，因此我使用了一种非常幼稚的方法，我从两个arrays中都替换了white space，然后使用了生成的boolean数组([True，False，True ])过滤原始数组.下面是代码:

Edit

I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space from both arrays and then using the resultant boolean array ([True, False, True]) to `filter on the origional array. Below is the code:

import numpy.core.defchararray as np_f
import numpy as np


array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]

推荐答案

很抱歉提出两个答案，但是在添加了上面的局部敏感哈希技术之后，我意识到您可以利用数据中的类分离(查询向量和潜在匹配向量)，使用布隆过滤器.

Sorry to post two answers, but after adding the locality-sensitive-hashing technique above, I realized you could exploit the class separation in your data (query vectors and potential matching vectors) by using a bloom filter.

Bloom过滤器是一个漂亮的对象，可以让您传入一些对象，然后查询以查看是否已将给定对象添加到Bloom过滤器中.这是一个 Bloom过滤器的绝佳视觉演示.

A bloom filter is a beautiful object that lets you pass in some objects, then query to see whether a given object has been added to the bloom filter. Here's an awesome visual demo of a bloom filter.

在您的情况下，我们可以将array_two的每个成员添加到Bloom过滤器中，然后查询array_one的每个成员以查看它是否在Bloom过滤器中.使用pip install bloom-filter:

In your case we can add each member of array_two to the bloom filter, then query each member of array_one to see whether it's in the bloom filter. Using pip install bloom-filter:

from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re

def clean(s):
  '''Clean a string'''
  return re.sub(r'\s+', '', s)

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)

结果:['wonder land']

根据您的要求，这可能是一个非常有效(且高度可扩展)的解决方案.

Depending on your requirements, this could be a very efficient (and highly-scalable) solution.

这篇关于将字符串从一个numpy数组匹配到另一个的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！