问题描述
人们的评论中有大量文本信息,这些信息是从html解析的,但是其中没有定界字符.例如:thumbgreenappleactiveassignmentweeklymetaphor
.显然,字符串中有拇指",绿色",苹果"等.我也有一个大词典来查询这个词是否合理.那么,提取这些单词的最快方法是什么?
正如eumiro指出的那样,我不太确定天真的算法能否很好地满足您的目的,因此,我将介绍一个稍微复杂一些的算法./p>
想法
最好的方法是建模输出的分布.一个很好的第一近似是假设所有单词都是独立分布的.然后,您只需要知道所有单词的相对频率即可.可以合理地假设它们遵循Zipf定律,即单词列表中排名为 n 的单词的概率大约为1/( n log N ),其中 N 是词典中的单词数.
一旦固定了模型,就可以使用动态编程来推断空间的位置.最有可能的句子是使每个单词的概率乘积最大化的句子,并且通过动态编程可以很容易地计算出该句子.代替直接使用概率,我们使用定义为概率倒数的对数的代价来避免溢出.
代码
import math
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
可以与之配合使用
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))
示例
我正在使用我整理的这本速读的125k单词词典,来自一小部分Wikipedia
如您所见,它本质上是完美的.最重要的部分是确保您的单词列表经过训练后的语料库与实际遇到的语料库相似,否则结果将非常糟糕.
优化
该实现消耗线性时间和内存,因此它是相当有效的.如果需要进一步的提速,可以从单词列表中构建后缀树,以减少候选集的大小.
如果需要处理非常大的连续字符串,则合理地分割字符串以避免过多的内存使用.例如,您可以按10000个字符的块处理文本,并在每侧加1000个字符的边距,以避免边界效应.这样可以将内存使用率降到最低,并且几乎可以肯定不会对质量产生任何影响.
There are masses of text information in people's comments which are parsed from html, but there are no delimiting characters in them. For example: thumbgreenappleactiveassignmentweeklymetaphor
. Apparently, there are 'thumb', 'green', 'apple', etc. in the string. I also have a large dictionary to query whether the word is reasonable.So, what's the fastest way to extract these words?
I'm not really sure a naive algorithm would serve your purpose well, as pointed out by eumiro, so I'll describe a slightly more complex one.
The idea
The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.
The code
import math
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
which you can use with
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))
Examples
I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.
As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.
Optimization
The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.
If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.
这篇关于如何有效地从连续的字符串中提取文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!