问题描述
Suppose you have a list of acronym's that define a value (ex. AB1,DE2,CC3) and you need to check a string value (ex. "Happy:DE2|234") to see if an acronym is found in the string. For a short list of acronym's I would usually create a simple RegEx that used a separator (ex. (AB1|DE2|CC3) ) and just look for a match.
But how would I tackle this if there are over 30 acronym's to match against? Would it make sense to use the same technique (ugly) or is there a more effecient and elegant way to accomplish this task?
Keep in mind the example acronym list and example string is not the actual data format that I am working with, rather just a way to express my challenge.
BTW, I read a SO related question but didn't think it applied to what I was trying to accomplish.
EDIT: I forgot to include my need to capture the matched value, hence the choice to use Regular Expressions...
Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:
var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
Console.WriteLine(match.Value);
// returns AB and CD
So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.
On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.
When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.
这篇关于针对大量可比对象测试现有字符串的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!