本文介绍了如何在文本中获取匹配的 n-gram 的偏移量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我想匹配文本中的字符串(n-gram),并使用一种方法来获取偏移量:

I would like to match a string ( n-gram) in a text, with a way to get offsets with it :

string_to_match = "许多工人工资很低"text = "纽约时报在一份报告中声称,在一些非洲国家,许多工人的工资非常低."

所以我想得到一个像这样的元组 ("matched", 44, 75) 其中 44 是开始,75 是结束.

so as result I want to get a tuple like this ("matched", 44, 75) where 44 is the start and 75 is the end occurrence.

这是我构建的代码,但它仅适用于 unigram.

here is the code I have build, but it works only for unigram.

def extract_offsets(line, _len=len):
    words = line.split()
    index = line.index
    offsets = []
    append = offsets.append
    running_offset = 0
    for word in words:
        word_offset = index(word, running_offset)
        word_len = _len(word)
        running_offset = word_offset + word_len
        append(("matched", word_offset, running_offset - 1))
    return offsets

def get_entities(offsets):
    entities = []
    for elm in offsets:
        if elm[0] == "string_to_match": # here string_to_match is only one word
            entities.append(elm)
    return entities

offsets = extract_offsets(text)
entities = get_entities(offsets) # [("matched", start, end)]

任何使之适用于字符串或 n-gram 序列的技巧!!

any tips to make that work for sequence of strings or n-grams!!

推荐答案

您可以在匹配的对象上re.finditer()并调用span()方法来获取匹配子串的开始和结束索引-

You can re.finditer() and call span() method on the matched object to get the beginning and the ending indices of the matched substring-

def m():
    string_to_match = "many workers are very underpaid"
    text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
    m = re.finditer(r'%s'%(string_to_match),text)
    for x in m:
        print x.group(0), x.span()     # x.span() will return the beginning and the ending indices of the matched substring as a tuple

这篇关于如何在文本中获取匹配的 n-gram 的偏移量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 06:59