问题描述
我需要一种非常有效的方法来解析
字边界上的大量文本(GB)。只要他们没有添加
,就会将单词添加到数组中。由于标点符号仍然存在,因此在空格上分裂有点过于基本了b / b
。也许正则表达式?
感谢您的任何见解。
Jim
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven''t already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?
Thanks for any insights.
Jim
推荐答案
你有几个选择。正则表达式分裂可以做你想要的;只是
拆分为[,。!?;:]。你也可以为你的单词定义一个正则表达式和
使用匹配()。
另一种选择是写一个词法分析器(词法分析器)。可能
是旧可靠的Lex和Flex的一些.Net等价物。不确定是否
他们在这种情况下会更快,而且对我来说似乎是一种巨大的杀戮。
或者如果你真的疯了,你可以手写一个词法分析器。 :-)
You''ve got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().
The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they''d be faster in this case, and seem like massive over kill to me.
Or if you''re really insane, you can hand-write a lexical analyzer. :-)
这篇关于将文字解析成文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!