我对此拆分功能有疑问。函数基本上使用一个字符串,例如word = 'optimization'
,并针对生成的随机数定义其分割点,然后将其分割成双字母组。 '0'
标记表示字尾。考虑下面的单词;左侧是输入,函数应该以相同的概率给出所有可能的输出之一,并且输出相同的单词:
'optimization' = [['op', 'ti'], ['ti', 'mizati'], ['mizati', 'on'], ['on', '0']
问题:当我分析所有函数时,此拆分函数消耗了最大的运行时(处理10万个字),但是我一直在优化它。我现在需要一些帮助。也可能有更好的方法,但我受自己的观点所束缚。
from numpy import mod
import nltk
def random_Bigramsplitter(word):
spw = []
length = len(word)
rand = random_int(word) # produce random number in respect to len(word)
if rand == length: # probability of not dividing
return [tuple([word, '0'])]
else:
div = mod(rand, (length + 1)) # defining division points by mod operation
bound = length-div
spw.append(div)
while div != 0:
rand = random_int(word)
div = mod(rand, (bound + 1))
bound = bound-div
spw.append(div)
result = spw
b = 0
points = []
for x in range(len(result) - 1): # calculating splitting points in respect to array structure
b += result[x]
points.append(b)
xy = 0
t = []
for i in points:
t.append(word[xy:i])
xy = i
if word[xy: len(word)] != '':
t.append(word[xy: len(word)])
t.extend('0')
c = [b for b in nltk.bigrams(t)]
return c
最佳答案
您可以更换
c = [b for b in nltk.bigrams(t)]
与
def get_ngram(word, n):
return zip(*[word[i:] for i in xrange(n)])
c = [b for b in get_ngram(t, 2)]
这似乎更快。我并不是说这是最快的解决方案。
有更多的答案可以优化您的双字速度。这似乎是一个很好的起点:Fast n-gram calculation,我的代码段来自:http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
关于python - Python中带bigram输出的分割算法的优化,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/32906936/