问题描述
我昨天对一个答案发表了评论,该答案是有人在正则表达式中使用了 [0123456789]
而不是 [0-9]
或 \ d 代码>.我说过使用范围或数字说明符可能比使用字符集更有效.
I made a comment yesterday on an answer where someone had used [0123456789]
in a regex rather than [0-9]
or \d
. I said it was probably more efficient to use a range or digit specifier than a character set.
我决定今天进行测试,令我惊讶的是(至少在c#regex引擎中) \ d
的效率似乎低于其他两个效率不高的产品似乎相差很大.这是我的测试输出,超过10000个随机字符串,包含1000个随机字符,其中5077个实际上包含一个数字:
I decided to test that out today and found out to my surprise that (in the c# regex engine at least) \d
appears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:
Regex \d took 00:00:00.2141226 result: 5077/10000
Regex [0-9] took 00:00:00.1357972 result: 5077/10000 63.42 % of first
Regex [0123456789] took 00:00:00.1388997 result: 5077/10000 64.87 % of first
有两个原因使我感到惊讶,如果有人可以向我说明以下信息,我将很感兴趣:
It's a surprise to me for two reasons, that I would be interested if anyone can shed some light on:
- 我本以为该范围的实施将比集合有效得多.
- 我不明白为什么
\ d
比[0-9]
差.除了简单的[0-9]
的简写形式,\ d
还有更多的东西吗?
- I would have thought the range would be implemented much more efficiently than the set.
- I can't understand why
\d
is worse than[0-9]
. Is there more to\d
than simply shorthand for[0-9]
?
这是测试代码:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace SO_RegexPerformance
{
class Program
{
static void Main(string[] args)
{
var rand = new Random(1234);
var strings = new List<string>();
//10K random strings
for (var i = 0; i < 10000; i++)
{
//generate random string
var sb = new StringBuilder();
for (var c = 0; c < 1000; c++)
{
//add a-z randomly
sb.Append((char)('a' + rand.Next(26)));
}
//in roughly 50% of them, put a digit
if (rand.Next(2) == 0)
{
//replace 1 char with a digit 0-9
sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
}
strings.Add(sb.ToString());
}
var baseTime = testPerfomance(strings, @"\d");
Console.WriteLine();
var testTime = testPerfomance(strings, "[0-9]");
Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
testTime = testPerfomance(strings, "[0123456789]");
Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
}
private static TimeSpan testPerfomance(List<string> strings, string regex)
{
var sw = new Stopwatch();
int successes = 0;
var rex = new Regex(regex);
sw.Start();
foreach (var str in strings)
{
if (rex.Match(str).Success)
{
successes++;
}
}
sw.Stop();
Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);
return sw.Elapsed;
}
}
}
推荐答案
\ d
检查所有Unicode数字,而 [0-9]
限于这10个字符.例如,波斯数字 ۱۲۳۴۵۶۷۸۹
是Unicode数字的示例与 \ d
匹配,但与 [0-9]
不匹配.
\d
checks all Unicode digits, while [0-9]
is limited to these 10 characters. For example, Persian digits, ۱۲۳۴۵۶۷۸۹
, are an example of Unicode digits which are matched with \d
, but not [0-9]
.
您可以使用以下代码生成所有此类字符的列表:
You can generate a list of all such characters using the following code:
var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
string str = Convert.ToChar(i).ToString();
if (Regex.IsMatch(str, @"\d"))
sb.Append(str);
}
Console.WriteLine(sb.ToString());
哪个生成:
这篇关于\ d比[0-9]效率低的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!