问题描述
词法分析器和解析器在理论上真的有很大不同吗?
Are lexers and parsers really that different in theory?
It seems fashionable to hate regular expressions: coding horror, another blog post.
但是,流行的基于词法的工具:色素, geshi 或美化 ,都使用正则表达式.他们似乎什么都没法...
However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything...
什么时候足够词汇,何时需要EBNF?
When is lexing enough, when do you need EBNF?
有没有人将这些词法分析器产生的令牌与bison或antlr解析器生成器一起使用?
Has anyone used the tokens produced by these lexers with bison or antlr parser generators?
推荐答案
解析器和词法分析器的共同点是
What parsers and lexers have in common:
-
他们从输入中读取某些字母的符号.
- 提示:字母不一定必须是字母.但它必须是该语言的原子符号解析器/词法分析器可以理解.
- 词法分析器的符号:ASCII字符.
- 解析器的符号:特定的标记,它们是其语法的末端符号.
They read symbols of some alphabet from their input.
- Hint: The alphabet doesn't necessarily have to be of letters. But ithas to be of symbols which are atomic for the languageunderstood by parser/lexer.
- Symbols for the lexer: ASCII characters.
- Symbols for the parser: the particular tokens, which are terminal symbols of their grammar.
- 真正的区别通常就在这里.有关更多信息,请参见下文.
- 词法学家所理解的语法:常规语法(Chomsky的第3级).
- 解析器理解的语法:无上下文语法(Chomsky的2级).
- 词法分析器通过将词法(来自输入的符号字符串)分类为特定的令牌来附加含义.例如.所有这些词素:
*
,==
,<=
,^
将被C/C ++词法分析器分类为运算符"令牌. - 解析器通过将来自输入(句子)的令牌字符串分类为特定的 nonterminals 并构建 parse树来附加含义.例如.所有这些令牌字符串:
[number][operator][number]
,[id][operator][id]
,[id][operator][number][operator][number]
将被C/C ++解析器非终止地分类为表达式".
- Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes:
*
,==
,<=
,^
will be classified as "operator" token by the C/C++ lexer. - Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings:
[number][operator][number]
,[id][operator][id]
,[id][operator][number][operator][number]
will be classified as "expression" nonterminal by the C/C++ parser.
- 当词法分析器识别出组成适当数字的字符序列时,可以将其转换为其二进制值,并使用数字"令牌进行存储.
- 类似地,当解析器识别出一个表达式时,它可以计算其值并将其存储在语法树的"expression"节点中.
- Lexers产生令牌,这是他们识别的常规语言的句子.每个令牌可以具有内部语法(尽管是3级,而不是2级),但这对于输出数据和读取它们的令牌都没有关系.
- 解析器产生语法树,它们是他们识别的无上下文语言的句子的表示.通常,对于整个文档/源文件来说,它只是一棵大树,因为整个文档/源文件对他们来说都是正确的句子.但是,没有任何理由使解析器无法在其输出中生成一系列语法树.例如.它可能是一个识别粘贴到纯文本中的SGML标签的解析器.因此,它将把SGML文档标记成一系列标记:
[TXT][TAG][TAG][TXT][TAG][TXT]...
.
- Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them.
- Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens:
[TXT][TAG][TAG][TXT][TAG][TXT]...
.
如您所见,解析器和令牌生成器有很多共同点.一个解析器可以是其他解析器的标记器,该解析器从其自己的字母表中将输入标记作为符号读取(标记只是某些字母的符号),就像一种语言的句子可以是其他更高级别的字母符号一样语言.例如,如果*
和-
是字母M
的符号(作为摩尔斯电码符号"),则可以构建一个解析器,将这些点和线的字符串识别为摩尔斯电码中编码的字母.对于其他解析器,摩尔斯代码"语言中的句子可能是代币,对于这些解析器来说,这些 tokens 是其语言的原子符号(例如英语单词"语言).这些英语单词"可能是理解英语句子"语言的某些高级解析器的标记(字母的符号).并且所有这些语言的区别仅在于语法的复杂性.没什么.
As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if *
and -
are the symbols of the alphabet M
(as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.
那么这些乔姆斯基语法水平"到底是什么呢?好吧,Noam Chomsky根据语法的复杂程度将其分为四个级别:
So what's all about these "Chomsky's grammar levels"? Well, Noam Chomsky classified grammars into four levels depending on their complexity:
-
级别3:常规语法
它们使用正则表达式,也就是说,它们只能由字母(a
,b
),它们的级联(ab
,aba
,bbb
等)或替代符号(例如a|b
).
它们可以实现为有限状态自动机(FSA),例如NFA(非确定有限自动机)或更好的DFA(确定性有限自动机).
常规语法不能用处理嵌套语法,例如正确地嵌套/匹配的括号(()()(()()))
,嵌套的HTML/BBcode标签,嵌套的块等.这是因为要处理它的状态自动机必须具有无限多个状态才能处理无限多个嵌套级别. -
级别2:无上下文语法
它们可以在语法树中具有嵌套的,递归的,自相似的分支,因此可以很好地处理嵌套结构.
它们可以实现为带有堆栈的状态自动机.该堆栈用于表示语法的嵌套级别.实际上,它们通常被实现为自上而下的递归下降解析器,该解析器使用机器的过程调用堆栈来跟踪嵌套级别,并对其语法中的每个非终端符号使用递归调用的过程/函数. >但是它们不能使用上下文敏感语法处理.例如.当您具有表达式x+3
时,在一个上下文中,该x
可以是变量的名称,在其他上下文中,它可以是函数的名称,等等. -
级别1:上下文相关语法
-
第0级:无限制的语法
也称为递归枚举语法.
Level 3: Regular grammars
They use regular expressions, that is, they can consist only of the symbols of alphabet (a
,b
), their concatenations (ab
,aba
,bbb
etd.), or alternatives (e.g.a|b
).
They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).
Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses(()()(()()))
, nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.Level 2: Context-free grammars
They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.
They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.
But they can't handle with a context-sensitive syntax. E.g. when you have an expressionx+3
and in one context thisx
could be a name of a variable, and in other context it could be a name of a function etc.Level 1: Context-sensitive grammars
Level 0: Unrestricted grammars
Also called recursively enumerable grammars.
这篇关于lexers vs解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!