


(I'm just learning how to write a compiler, so please correct me if I make any incorrect claims)

为什么有人仍会在代码中实现DFA( goto语句,表驱动的实现)何时可以仅使用正则表达式?据我了解,词法分析器会输入一串字符并列出一系列标记,这些标记在语言的语法定义中是终端,因此可以用正则表达式对其进行描述。

Why would anyone still implement DFAs in code (goto statements, table-driven implementations) when they can simply use regular expressions? As far as I understand, lexical analyzers take in a string of characters and churn out a list of tokens which, in the languages' grammar definition, are terminals, making it possible for them to be described by a regular expression. Wouldn't it be easier to just loop over a bunch of regexes, breaking out of the loop if it finds a match?



You're absolutely right that it's easier to write regular expressions than DFAs. However, A good question to think about is


Most very fast implementations of regex matchers work by compiling down to some type of automaton (either an NFA or a minimum-state DFA) internally. If you wanted to build a scanner that worked by using regexes to describe which tokens to match and then looping through all of them, you could absolutely do so, but internally they'd probably compile to DFAs.

很少有人会真正为DFA编写代码以进行扫描或解析,因为它是如此的复杂。这就是为什么有诸如 lex flex 之类的工具的原因,它们使您可以指定要匹配的正则表达式,然后自动向下编译到幕后的DFA。这样一来,您就可以兼得两全其美-您可以使用更好的正则表达式框架描述要匹配的内容,但是可以在后台获得DFA的速度和效率。

It's extremely rare to see anyone actually code up a DFA for doing scanning or parsing because it's just so complicated. This is why there are tools like lex or flex, which let you specify the regexes to match and then automatically compile down to DFAs behind the scenes. That way, you get the best of both worlds - you describe what to match using the nicer framework for regexes, but you get the speed and efficiency of DFAs behind the scenes.


One more important detail about building a giant DFA is that it is possible to build a single DFA that tries matching multiple different regular expressions in parallel. This increases efficiency, since it's possible to run the matching DFA over the string in a way that will concurrently search for all possible regex matches.



08-03 18:01