问题描述
我有一个大的文本字符串,还有一堆正则表达式模式
我需要在字符串中查找。换句话说,在里面找到所有的b $ b'代币'。当然我可以使用正则表达式引擎,物品
,如果我有500种不同的模式,这意味着在
缓冲区上搜索500次。有没有人对如何加速这个问题有任何一般的想法
up?换句话说,我正在寻找做旧的''LEX''工具在unix
过去做的事情 - 一次传递字符串,找到所有模式并制作
他们的代币...我需要的不是那么复杂,所以我想知道
如果.net regexp可以一次性搜索manny模式。
,如果答案是否定的,那么.net中是否有任何lex实现? :)
谢谢
Jonathan
Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
''tokens'' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old ''LEX'' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.
and if the answer is no - is there any lex implementation in .net? :)
Thanks
Jonathan
推荐答案
我不知道lex是如何实现的。而且我不知道状态机器是否是解决问题的最佳方法。但我确实知道这是一个解决问题的合理方法,而且我之前写了一个简单的
实现并在此处发布。你可以在这篇文章中看到它:
我在实现中看到了一些我可能会做的事情做不同的
如果我今天再次这样做,但它应该工作。或者至少
类似的东西。
对于它的价值,这个大文本字符串有多大?你经常需要怎么样?
你需要这么做?如果您的代码需要经常反复执行,那么优化
实现将非常有用。但是,我猜想即使是一个价值100美元左右的100美元以上的搜索也不会花那么长时间,只需要做一次。
使用强力方法有一定的价值,因为它会使
代码保持简单。我不担心性能,除非你有一个很好的理由相信它会成为一个问题。
希望有所帮助。
Pete
I don''t know how lex was implemented. And I don''t know whether a state
machine is the best way to solve the problem. But I do know that it''s a
reasonable way to solve the problem, and that I wrote a simple
implementation awhile ago and posted it here. You can see it in this post:
http://groups.google.com/group/micro...06f696d4500b77
I see some things in the implementation that I''d probably do differently
if I were doing it again today, but it ought to work. Or at least
something like it.
For what it''s worth, just how large is this "large text string"? And how
frequently do you need to do this? If this is something that your code
needs to do over and over on a frequent basis, optimizing the
implementation would be useful. But I''d guess that 500 searches on even a
100K string or so wouldn''t take that long, just to do it once.
There''s some value in using the brute-force method, as it will keep the
code a _lot_ simpler. I wouldn''t worry about the performance unless you
have a good reason to believe it will be a problem.
Hope that helps.
Pete
此外,正则表达式匹配字符串可以多长时间?假设没有
实际限制 - 也就是说,你可以在那里添加任意字符串 - 然后
因为Regex支持boolean或在搜索模式字符串中,你可以
只有一个包含所有标记的搜索模式字符串。
所以不要循环多次搜索使用Regex,只需在创建搜索字符串时循环使用
代币。然后让正则表达式完成所有艰难的工作。
这会比状态图快或快吗?我不知道......
取决于正则表达式作者是否付出了一些努力来优化
案例。我不太了解Regex(实现_or_ API :))
有答案。但即使他们没有,显然Sam和我
同意更简单的代码更好,只要没有直接的证据表明性能实际上是将成为一个问题。
Pete
In addition, how long can a Regex match string be? Assuming there''s no
practical limit -- that is, you can put any arbitrary string there -- then
since Regex supports boolean "or" in the search pattern string, you could
just have a single search pattern string with all of the tokens in it.
So rather than looping with multiple searches using Regex, just loop on
the tokens in creating the search string. Then let Regex do all the hard
work.
Would this be as fast or faster than the state graph? I don''t know...it
depends on whether the Regex authors put some effort into optimizing that
case. I don''t know enough about Regex (implementation _or_ API :) ) to
have an answer to that. But even if they didn''t, obviously Sam and I
agree that the simpler code is better as long as there''s no direct
evidence that performance is actually going to be an issue.
Pete
这篇关于对大缓冲区进行标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!