问题描述
我想从令牌列表中构造一个AST。我正在编写脚本语言,我已经完成了词法分析部分,但我不知道如何创建AST。所以问题是,我该怎么做这样的事情:
I want to construct an AST from a list of tokens. I'm making a scripting language and I've already done the lexical analysis part, but I have no idea how to create an AST. So the question is, how do I take something like this:
WORD, int
WORD, x
SYMBOL, =
NUMBER, 5
SYMBOL, ;
并将其转换为抽象语法树?我喜欢这样做没有像ANTLR这样的库,或者其他什么,我宁愿自己尝试从头开始。但是,如果这是一个非常复杂的任务,我不介意使用库:)谢谢
and convert it into an Abstract Syntax Tree? Preferably, I'd like to do so without a library like ANTLR or whatever, I'd rather try and do it from scratch myself. However, if it's a really complex task, I don't mind using a library :) Thanks
推荐答案
根本的诀窍是认识到解析,无论如何完成,都是以增量步骤进行的,包括逐个读取令牌。
The fundamental trick is to recognize that parsing, however accomplished, happens in incremental steps, including the reading of the tokens one by one.
在每个增量步骤中,都有机会构建部分AST通过组合由其他增量步骤构建的AST片段。这是一个递归的想法,它在扫描时为标记构建AST叶节点。这个基本思想出现在几乎所有AST构建解析器中。
At each incremental step, there is an opportunity to build part of the AST by combining AST fragments built by other incremental steps. This is a recursive idea, and it bottoms out in building AST leaf nodes for tokens as they are scanned. This basic idea occurs in pretty much all AST-building parsers.
如果构建一个递归下降解析器,实际上构建了一个递归过程的协作系统,每个都是它识别正在实现的语法中的非终结符。对于纯解析,每个过程只返回一个非终结(未)识别的布尔值。
If one builds a recursive descent parser, one in effect builds a cooperating system of recursive procedures, each one of which recognizes a nonterminal in whatever grammar is being implemented. For pure parsing, each procedure simply returns a boolean for "nonterminal (not) recognized".
要使用递归下降解析器构建AST,可以设计这些过程以返回两个值:布尔识别,如果被识别,则为非终结符构造(以某种方式)AST。 (常见的黑客是返回一个指针,对于未识别是空的,或者如果识别则指向构造的AST)。构建单个过程的结果AST的方法是组合它调用的子过程中的AST。对于叶子过程来说,这是非常简单的,它读取输入令牌并可以立即构建树。
To build an AST with a recursive descent parser, one designs these procedures to return two values: the boolean "recognized", and, if recognized, an AST constructed (somehow) for the nonterminal. (A common hack is return a pointer, which is void for "not recognized", or points to the constructed AST if "recognized"). The way the resulting AST for a single procedure is built, is by combining the ASTs from the sub-procedures that it invokes. This is pretty trivial to do for leaf procedures, which read an input token and can immediately build a tree.
所有这一切的缺点是必须手动编码递归下降,并使用树构建步骤对其进行扩充。在宏观方案中,这对于小型语法来说实际上很容易编码。
The downside to all this is one must manually code the recursive descent, and augment it with the tree building steps. In the grand scheme of things, this is actually pretty easy to code for small grammars.
对于OP的例子,假设我们有这个语法:
For OP's example, assume we have this grammar:
GOAL = ASSIGNMENT
ASSIGNMENT = LHS '=' RHS ';'
LHS = IDENTIFIER
RHS = IDENTIFIER | NUMBER
好的,我们的递归下降解析器:
OK, our recursive descent parser:
boolean parse_Goal()
{ if parse_Assignement()
then return true
else return false
}
boolean parse_Assignment()
{ if not Parse_LHS()
then return false
if not Parse_equalsign()
then throw SyntaxError // because there are no viable alternatives from here
if not Parse_RHS()
then throw SyntaxError
if not Parse_semicolon()
the throw SyntaxError
return true
}
boolean parse_LHS()
{ if parse_IDENTIFIER()
then return true
else return false
}
boolean parse_RHS()
{ if parse_IDENTIFIER()
then return true
if parse_NUMBER()
then return true
else return false
}
boolean parse_equalsign()
{ if TestInputAndAdvance("=") // this can check for token instead
then return true
else return false
}
boolean parse_semicolon()
{ if TestInputAndAdvance(";")
then return true
else return false
}
boolean parse_IDENTIFIER()
{ if TestInputForIdentifier()
then return true
else return false
}
boolean parse_NUMBER()
{ if TestInputForNumber()
then return true
else return false
}
现在,让我们修改它构建一个抽象语法树:
Now, let's revise it build a abstract syntax tree:
AST* parse_Goal() // note: we choose to return a null pointer for "false"
{ node = parse_Assignment()
if node != NULL
then return node
else return NULL
}
AST* parse_Assignment()
{ LHSnode = Parse_LHS()
if LHSnode == NULL
then return NULL
EqualNode = Parse_equalsign()
if EqualNode == NULL
then throw SyntaxError // because there are no viable alternatives from here
RHSnode = Parse_RHS()
if RHSnode == NULL
then throw SyntaxError
SemicolonNode = Parse_semicolon()
if SemicolonNode == NULL
the throw SyntaxError
return makeASTNode(ASSIGNMENT,LHSNode,RHSNode)
}
AST* parse_LHS()
{ IdentifierNode = parse_IDENTIFIER()
if node != NULL
then return IdentifierNode
else return NULL
}
AST* parse_RHS()
{ RHSnode = parse_IDENTIFIER()
if RHSnode != null
then return RHSnode
RHSnode = parse_NUMBER()
if RHSnode != null
then return RHSnode
else return NULL
}
AST* parse_equalsign()
{ if TestInputAndAdvance("=") // this can check for token instead
then return makeASTNode("=")
else return NULL
}
AST* parse_semicolon()
{ if TestInputAndAdvance(";")
then return makeASTNode(";")
else return NULL
}
AST* parse_IDENTIFIER()
{ text = TestInputForIdentifier()
if text != NULL
then return makeASTNode("IDENTIFIER",text)
else return NULL
}
AST* parse_NUMBER()
{ text = TestInputForNumber()
if text != NULL
then return makeASTNode("NUMBER",text)
else return NULL
}
我显然已经掩饰了一些细节,但我认为读者可以毫不费力地填写它们。
I've obviously glossed over some details, but I assume the reader will have no trouble filling them in.
像JavaCC和ANTLR这样的解析器生成器工具基本上生成递归下降解析器,并且具有构建非常类似的树的工具。
Parser generator tools like JavaCC and ANTLR basically generate recursive descent parsers, and have facilities for constructing trees that work very much like this.
构建自下而上解析器(YACC,Bison,GLR,...)的解析器生成器工具也以相同的样式构建AST节点。但是,没有一组递归函数;相反,这些工具管理一堆令牌和减少到非终结的令牌。 AST节点构建在并行堆栈上;当减少发生时,由缩减覆盖的堆栈部分上的AST节点被组合以产生非终结AST节点以替换它们。这种情况发生在语法规则的零大小堆栈段中,它们也是空的,导致AST节点(通常用于空列表或缺少选项)看似无处不在。
Parser generator tools that build bottom-up parsers (YACC, Bison, GLR, ...) also build AST nodes in the same style. However, there is no set of recursive functions; instead, a stack of tokens seen and reduced-to nonterminals is managed by these tools. The AST nodes are constructed on a parallel stack; when a reduction occurs, the AST nodes on the part of the stack covered by the reduction are combined to produce a nonterminal AST node to replace them. This happens with "zero-size" stack segments for grammar rules which are empty too causing AST nodes (typically for 'empty list' or 'missing option') to seemingly appear from nowhere.
使用多种语言,编写构建树的递归下降解析器是非常实用的。
With bitty languages, writing recursive-descent parsers that build trees is pretty practical.
真实语言的问题(无论是古老的还是古老的,如COBOL或hot和像Scala一样闪亮的是语法规则的数量非常大,并且由于语言的复杂性以及对语言委员会负责的任何语言委员会的坚持而不断添加其他语言提供的新东西(语言嫉妒,见Java,C#和C ++之间的进化竞争。现在编写一个递归下降解析器已经失控,人们倾向于使用解析器生成器。但即使使用解析器生成器,编写所有自定义代码来构建AST节点也是一场大战(而且我们还没有讨论设计一个好的抽象语法与想到的第一件事情所需要的东西)。维持语法规则和AST构建goo随着规模和持续演进而变得越来越难。 (如果您的语言成功,一年之内您就会想要改变它)。因此,即使编写AST构建规则也会变得笨拙。
A problem with real languages (whether old and hoary like COBOL or hot and shiny like Scala) is that the number of grammar rules is pretty large, complicated by the sophistication of the language and the insistence on whatever language committee is in charge of it to perpetually add new goodies offered by other languages ("language envy", see the evolutionary race between Java, C# and C++). Now writing a recursive descent parser gets way out of hand and one tends to use parser generators. But even with a parser generator, writing all the custom code to build AST nodes is also a big battle (and we haven't discussed what it takes to design a good "abstract" syntax vs. the first thing that comes to mind). Maintaining grammar rules and AST building goo gets progressively harder with scale and ongoing evolution. (If your language is successful, within a year you'll want to change it). So even writing the AST building rules gets awkward.
理想情况下,人们只想编写语法,并获得解析器和树。你,对语法工程师没有任何作用;自1995年以来,它一直在这样做.ANTLR的人终于在2014年想出了这个,而ANTLR4现在提供了这样的选项。
Ideally, one would just like to write a grammar, and get a parser and tree. You can do this with some recent parser generators: Our DMS Software Reengineering Toolkit accepts full context free grammars, and automatically constructs an AST, no work on the grammar engineer's part; its been doing this since 1995. The ANTLR guys finally figured this out in 2014, and ANTLR4 now offers an option like this.
最后一点:有一个解析器(即使是一个AST)几乎不是你要解决的实际问题的解决方案,无论它是什么。它只是一个基础部分,对于大多数解析器 - 新手而言,它的震撼很大,它是操作代码的工具的最小部分。谷歌我的解析后的生活文章(或检查我的生物)了解更多细节。
Last point: having a parser (even with an AST) is hardly a solution to the actual problem you set out to solve, whatever it was. Its just a foundation piece, and much to the shock for most parser-newbies, it is the smallest part to a tool that manipulates code. Google my essay on Life After Parsing (or check my bio) for more detail.
这篇关于使用标记列表构造抽象语法树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!