昨天,我对一个答案发表了评论,该答案是有人在regular expression中使用[0123456789]而不是[0-9]\d。我说过使用范围或数字说明符可能比使用字符集更有效。

我决定今天进行测试,令我惊讶的是,(至少在C#正则表达式引擎中)\d的效率似乎比其他两个似乎相差不大的效率低。这是我的测试输出,超过10000个随机字符串,包含1000个随机字符,其中5077个实际上包含一个数字:

Regular expression \d           took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

令我惊讶的有两个原因:
  • 我以为该范围将比集合有效得多。
  • 我无法理解\d为什么比[0-9]更糟糕。除了\d的简写之外,[0-9]还有更多的功能吗?

  • 这是测试代码:
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Diagnostics;
    using System.Text.RegularExpressions;
    
    namespace SO_RegexPerformance
    {
        class Program
        {
            static void Main(string[] args)
            {
                var rand = new Random(1234);
                var strings = new List<string>();
                //10K random strings
                for (var i = 0; i < 10000; i++)
                {
                    //Generate random string
                    var sb = new StringBuilder();
                    for (var c = 0; c < 1000; c++)
                    {
                        //Add a-z randomly
                        sb.Append((char)('a' + rand.Next(26)));
                    }
                    //In roughly 50% of them, put a digit
                    if (rand.Next(2) == 0)
                    {
                        //Replace one character with a digit, 0-9
                        sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                    }
                    strings.Add(sb.ToString());
                }
    
                var baseTime = testPerfomance(strings, @"\d");
                Console.WriteLine();
                var testTime = testPerfomance(strings, "[0-9]");
                Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
                testTime = testPerfomance(strings, "[0123456789]");
                Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            }
    
            private static TimeSpan testPerfomance(List<string> strings, string regex)
            {
                var sw = new Stopwatch();
    
                int successes = 0;
    
                var rex = new Regex(regex);
    
                sw.Start();
                foreach (var str in strings)
                {
                    if (rex.Match(str).Success)
                    {
                        successes++;
                    }
                }
                sw.Stop();
    
                Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);
    
                return sw.Elapsed;
            }
        }
    }
    

    最佳答案

    \d检查所有Unicode数字,而[0-9]限于这10个字符。例如,Persian数字۱۲۳۴۵۶۷۸۹是与\d而不是[0-9]匹配的Unicode数字的示例。

    您可以使用以下代码生成所有此类字符的列表:

    var sb = new StringBuilder();
    for(UInt16 i = 0; i < UInt16.MaxValue; i++)
    {
        string str = Convert.ToChar(i).ToString();
        if (Regex.IsMatch(str, @"\d"))
            sb.Append(str);
    }
    Console.WriteLine(sb.ToString());
    

    产生:

    07-27 18:47