问题描述
我的最终目标是将结构化文件解析为内存中对象树,然后我可以对其进行操作.我使用的文件格式相当复杂,大约有 200 个关键字/标签,这似乎是学习解析器/词法分析器框架的一个很好的理由.
My ultimate goal is to parse a structured file as a tree of in-memory objects that I can then manipulate. The file format that I'm using is fairly sophisticated with about 200 keywords/tags, and this seemed like a good reason to learn about parser/lexer frameworks.
不幸的是,有太多的概念(以及数百个教程和指南),到目前为止的学习过程感觉就像试图用消防水管喝水.所以我正在采取一些非常微薄的婴儿步骤,从 这个例子.
Unfortunately, there are so many concepts (and hundreds of tutorials and guides) that the learning process so far feels like trying to drink from a fire hose. So I'm taking some very meager baby steps, starting with this example.
我修改了语法以创建以下测试 Nano.g4:
I modified the grammar to create the following test, Nano.g4:
grammar Nano;
r : root ;
root : START ROOT ID END ROOT;
START : 'StartBlock' ;
END : 'EndBlock' ;
ROOT : 'RootItem' ;
ID : [a-z]+ ; // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
接下来,我创建了一个简单的输入文件,nano.txt:
Next, I created a simple input file, nano.txt:
StartBlock RootItem
foo
EndBlock RootItem
然后我使用以下命令加载代码:
I then loaded the code using the following commands:
del *.class
del *.java
java org.antlr.v4.Tool Nano.g4
javac nano*.java
java org.antlr.v4.runtime.misc.TestRig Nano r -gui < nano.txt
这给了我这个结果:
上面的树是我关于词法分析器和解析器的期望的第一个概念性挂断.就使输入文件合法而言,StartBlock RootItem"和EndBlock RootItem"标记是必要的,但从概念上讲,在我证明文件格式正确后,我不需要它们.从那时起,我唯一关心的是有一个包含foo"的 RootItem,如下所示:
The tree above is my first conceptual hangup about what to expect from a lexer and parser. The "StartBlock RootItem" and "EndBlock RootItem" tokens are necessary in terms of making the input file legal, but conceptually I don't need them after I've proven that the file is properly formatted. The only thing that I care about from that point on is that there's a RootItem that contains "foo", as shown here:
再说一次,我对解析器/词法分析器的概念非常陌生.是否应该我(或者,甚至可能)编写语法以使输出树与上图相匹配?或者我应该在遍历 AST 并只提取相关数据字段的某个后续步骤中处理这个问题?
Again, I'm painfully new to parser/lexer concepts. Should I (or, is it even possible to) write the grammar so the output tree matches the image above? Or should I take care of that in some subsequent step that traverses the AST and only extracts the relevant data fields?
推荐答案
ANTLR 4 生成解析树,而不是 AST.这是与 ANTLR 3 行为的重要区别,选择它是为了帮助长期维护语法.特别是,可能会出现用户确实想要访问令牌的情况,例如作为 IDE 中语义突出显示组件的一部分.在这种情况下,我们没有强迫用户编写特定于应用程序的修改过的语法,而是选择始终在解析树中包含所有标记.
ANTLR 4 produces parse trees, not ASTs. This is an important distinction from the behavior of ANTLR 3, and was chosen to help with long-term maintenance of grammars. In particular, situations may arise where users do want access to the tokens, e.g. as part of a semantic highlighting component in an IDE. Rather than force users to write application-specific modified grammars in such a scenario, we chose to always include all tokens in the parse tree.
这篇关于antlr 4:所有这些令牌都应该出现在 AST 中吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!