我对此拆分功能有疑问。函数基本上使用一个字符串,例如word = 'optimization',并针对生成的随机数定义其分割点,然后将其分割成双字母组。 '0'标记表示字尾。考虑下面的单词;左侧是输入,函数应该以相同的概率给出所有可能的输出之一,并且输出相同的单词:

'optimization' = [['op', 'ti'], ['ti', 'mizati'], ['mizati', 'on'], ['on', '0']


问题:当我分析所有函数时,此拆分函数消耗了最大的运行时(处理10万个字),但是我一直在优化它。我现在需要一些帮助。也可能有更好的方法,但我受自己的观点所束缚。

from numpy import mod
import nltk

def random_Bigramsplitter(word):
    spw = []
    length = len(word)
    rand = random_int(word)  # produce random number in respect to len(word)

    if rand == length:  # probability of not dividing
        return [tuple([word, '0'])]
    else:
        div = mod(rand, (length + 1))  # defining division points by mod operation
        bound = length-div
        spw.append(div)
        while div != 0:
            rand = random_int(word)
            div = mod(rand, (bound + 1))
            bound = bound-div
            spw.append(div)
        result = spw

    b = 0
    points = []
    for x in range(len(result) - 1):  # calculating splitting points in respect to array structure
        b += result[x]
        points.append(b)

    xy = 0
    t = []
    for i in points:
        t.append(word[xy:i])
        xy = i

    if word[xy: len(word)] != '':
        t.append(word[xy: len(word)])

    t.extend('0')
    c = [b for b in nltk.bigrams(t)]

    return c

最佳答案

您可以更换

c = [b for b in nltk.bigrams(t)]




def get_ngram(word, n):
    return zip(*[word[i:] for i in xrange(n)])

c = [b for b in get_ngram(t, 2)]


这似乎更快。我并不是说这是最快的解决方案。

有更多的答案可以优化您的双字速度。这似乎是一个很好的起点:Fast n-gram calculation,我的代码段来自:http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/

关于python - Python中带bigram输出的分割算法的优化,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/32906936/

10-12 22:29
查看更多