问题描述
我不小心回答了,其中原始问题涉及将句子分割为单独的单词。
I accidentally answered a question where the original problem involved splitting sentence to separate words.
作者来标记输入字符串,有些人喜欢这个想法。
And the author suggested to use BreakIterator
to tokenize input strings and some people liked this idea.
我只是没有得到疯狂:如何25行复杂的代码可以比一个简单的单线程regexp更好?
I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?
请告诉我使用BreakIterator的好处,以及应该使用BreakIterator的真实情况。
Please, explain me the pros of using BreakIterator and the real cases when it should be used.
如果它真的那么酷和适当,那么我不知道:你真的使用 BreakIterator
?
If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator
in your projects?
推荐答案
从查看在该答案发布的代码,它看起来像 BreakIterator
考虑文本的语言和区域设置。通过正则表达式获得这种水平的支持肯定会是一个相当大的痛苦。也许这是一个简单正则表达式优先的主要原因?
From looking at the code posted at that answer, it looks like BreakIterator
takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?
这篇关于将文本拆分为句子和句子到词语:BreakIterator与正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!