问题描述
我想匹配文本中的字符串(n-gram),并使用一种方法来获取偏移量:
I would like to match a string ( n-gram) in a text, with a way to get offsets with it :
string_to_match = "许多工人工资很低"text = "纽约时报在一份报告中声称,在一些非洲国家,许多工人的工资非常低."
所以我想得到一个像这样的元组 ("matched", 44, 75)
其中 44 是开始,75 是结束.
so as result I want to get a tuple like this ("matched", 44, 75)
where 44 is the start and 75 is the end occurrence.
这是我构建的代码,但它仅适用于 unigram.
here is the code I have build, but it works only for unigram.
def extract_offsets(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append(("matched", word_offset, running_offset - 1))
return offsets
def get_entities(offsets):
entities = []
for elm in offsets:
if elm[0] == "string_to_match": # here string_to_match is only one word
entities.append(elm)
return entities
offsets = extract_offsets(text)
entities = get_entities(offsets) # [("matched", start, end)]
任何使之适用于字符串或 n-gram 序列的技巧!!
any tips to make that work for sequence of strings or n-grams!!
推荐答案
您可以在匹配的对象上re.finditer()
并调用span()
方法来获取匹配子串的开始和结束索引-
You can re.finditer()
and call span()
method on the matched object to get the beginning and the ending indices of the matched substring-
def m():
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
m = re.finditer(r'%s'%(string_to_match),text)
for x in m:
print x.group(0), x.span() # x.span() will return the beginning and the ending indices of the matched substring as a tuple
这篇关于如何在文本中获取匹配的 n-gram 的偏移量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!