问题描述
我试图找出在字典(或哈希,地图或任何你最喜欢的语言调用它)中执行查找是否合理有效的方式,其中键是正则表达式和字符串被查看反对一组键。例如(在Python语法中): >>> regex_dict = {re.compile(r'foo。'):12,re.compile(r'^ FileN。* $'):35}
>>> regex_dict ['food']
12
>>> regex_dict ['foot in my mouth']
12
>>> regex_dict ['FileNotFoundException:file.x不存在']
35
(显然,上面的例子将不会像Python这样写,但这是我想要做的事情。)
我可以想到一个天真的方式实现这一点,我在其中迭代字典中的所有键,并尝试将传入的字符串与它们匹配,但是丢失了散列映射的O(1)查找时间,而是丢弃O(n)其中n是我的字典中的键数。这可能是一个很大的事情,因为我希望这个字典变得非常大,我将需要一遍又一遍的搜索(实际上我需要迭代它在文本文件中读取的每一行,而文件可以是数百兆字节的大小)
有没有办法完成这个,而不诉诸O(n)效率?
或者,如果你知道在数据库中完成这种查找的方法,那也是很棒的。
(Any编程语言很好 - 我使用Python,但我对这里的数据结构和算法更感兴趣。)
有人指出,是可能的,这是绝对正确的。理想情况下,我想返回一个包含所有匹配的列表或元组。不过,我会解决第一场比赛。
我在这种情况下看不到O(1)可能;不过,我会为O(n)提供任何东西。另外,底层的数据结构可能是任何东西,但是我想要的基本行为就是我上面写的:查找一个字符串,并返回与正则表达式键匹配的值。
你想做的是非常类似于xrdb支持的。他们只支持一个相当微乎其微的概念,但是。
在内部你可以通过将正则表达式存储为字符特技来实现比他们更大的常规语言系列。
- 单个字符只是成为trie节点。
- 。成为通配符插入,覆盖当前线索节点的所有子节点。
- *在上一个项目开始时成为返回链接到节点的链接。
- [a-z]范围在范围内的每个字符下重复插入相同的后续子节点。小心,插入/更新可能有些昂贵,搜索可以是字符串大小的线性。有些占位符的东西可以控制常见的组合爆炸案例。
- (foo)|(bar)节点成为多个插入
在字符串中的任意点发生的正则表达式,但是可以通过在.b边上包装正则表达式来建模。
Perl有几个Text :: Trie你可以通过这个模块来搜索想法。 (哎哟,我想我甚至写了一个回来的时候)
I'm trying to figure out if there's a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the keys are regular expressions and strings are looked up against the set of keys. For example (in Python syntax):
>>> regex_dict = { re.compile(r'foo.') : 12, re.compile(r'^FileN.*$') : 35 }
>>> regex_dict['food']
12
>>> regex_dict['foot in my mouth']
12
>>> regex_dict['FileNotFoundException: file.x does not exist']
35
(Obviously the above example won't work as written in Python, but that's the sort of thing I'd like to be able to do.)
I can think of a naive way to implement this, in which I iterate over all of the keys in the dictionary and try to match the passed in string against them, but then I lose the O(1) lookup time of a hash map and instead have O(n), where n is the number of keys in my dictionary. This is potentially a big deal, as I expect this dictionary to grow very large, and I will need to search it over and over again (actually I'll need to iterate over it for every line I read in a text file, and the files can be hundreds of megabytes in size).
Is there a way to accomplish this, without resorting to O(n) efficiency?
Alternatively, if you know of a way to accomplish this sort of a lookup in a database, that would be great, too.
(Any programming language is fine -- I'm using Python, but I'm more interested in the data structures and algorithms here.)
Someone pointed out that more than one match is possible, and that's absolutely correct. Ideally in this situation I'd like to return a list or tuple containing all of the matches. I'd settle for the first match, though.
I can't see O(1) being possible in that scenario; I'd settle for anything less than O(n), though. Also, the underlying data structure could be anything, but the basic behavior I'd like is what I've written above: lookup a string, and return the value(s) that match the regular expression keys.
What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.
Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.
- single characters just become trie nodes.
- .'s become wildcard insertions covering all children of the current trie node.
- *'s become back links in the trie to node at the start of the previous item.
- [a-z] ranges insert the same subsequent child nodes repeatedly under each of the characters in the range. With care, while inserts/updates may be somewhat expensive the search can be linear in the size of the string. With some placeholder stuff the common combinatorial explosion cases can be kept under control.
- (foo)|(bar) nodes become multiple insertions
This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.
Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)
这篇关于Hashtable /字典/地图查找与正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!