问题描述
如果有人能让我从前瞻关系与涉及贪婪/非贪婪匹配的标记化背后的困惑中清醒过来,我会非常高兴.请注意,这是一篇略长的帖子,因为它遵循了我的思考过程.
If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
我正在尝试编写允许我匹配输入的 antlr3 语法,例如:
I'm trying to write antlr3 grammar that allows me to match input such as:
标识符关键字"
我想出了一个类似 Antlr 3.4 的语法:
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
但是它抱怨它永远无法以这种方式匹配 IDENTIFIER,我真的不明白.(以下替代方案永远无法匹配:1)
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
基本上,我试图为尝试匹配 (LOWCHAR|HIGHCHAR) 非贪婪方式的词法分析器指定,以便它在 KEYWORD 前瞻处停止.到目前为止,我所读到的关于 ANTLR 词法分析器的内容应该是词法分析器规则的某种优先级.如果我在词法分析器语法中首先指定 KEYWORD 词法分析器规则,那么后面的任何词法分析器规则都不能匹配使用的字符.
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
经过一番搜索,我明白这里的问题是它无法以正确的方式标记输入,因为例如对于输入:identifierkeyword",identifier"部分首先出现,因此它决定在出现时开始匹配 IDENTIFIER 规则还没有匹配的 KEYWORD 标记.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
然后我尝试在 ANTLR 4 中编写相同的语法,以测试新的超前功能是否可以匹配我想要的,它看起来像这样:
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
对于输入:identifierkeyword",它会产生以下错误:第 1:1 行不匹配的输入 'd' 需要 'keyword'
for the input: "identifierkeyword" it produces this error:line 1:1 mismatched input 'd' expecting 'keyword'
它匹配字符 'i'(第一个字符)作为 IDENTIFIER 标记,然后解析器需要一个他没有通过这种方式得到的 KEYWORD 标记.
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
词法分析器的非贪婪匹配是否应该匹配,直到有任何其他可能性在前瞻中可用?难道它不应该预见 IDENTIFIER 可以包含 KEYWORD 并以这种方式匹配的可能性吗?
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
我对此感到非常困惑,我观看了视频,其中 Terence Parr 介绍了 ANTLR4 的新功能,在该视频中,他谈到了在实际匹配规则的同时观察所有正确"解决方案直到最后的超前线程.我认为它也适用于词法分析器规则,其中标记输入identifierkeyword"的可能正确解决方案是匹配 IDENTIFIER:identifier"和匹配 KEYWORD:keyword"
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
我认为我对非贪婪/贪婪匹配有很多误解.有人可以解释一下它是如何工作的吗?
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
毕竟我在这里发现了一个类似的问题:ANTLR尝试在更长的标记中匹配标记并制作与之对应的语法:
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
这就是我现在想要的,但是我不明白为什么我不能将标识符规则更改为 Lexer 规则并将 LOWCHAR 和 HIGHCHAR 更改为片段.词法分析器不知道关键字"中的字母可以作为标识符匹配吗?或相反亦然?或者可能是规则仅被定义为在其内部具有前瞻性,而不是所有可能的匹配语法?
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
推荐答案
在 ANTLR 3 和 ANTLR 4 中解决此问题的最简单方法是只允许 IDENTIFIER
匹配单个输入字符,并且然后创建一个解析器规则来处理这些字符的序列.
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER
to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
这会导致词法分析器将输入 identifier
作为 10 个单独的字符跳过,然后将 keyword
作为单个 KEYWORD
标记读取.
This would cause the lexer to skip the input identifier
as 10 separate characters, and then read keyword
as a single KEYWORD
token.
您在 ANTLR 4 中使用非贪婪运算符 +?
观察到的行为与此类似.这个操作符说匹配尽可能少的 (HIGHCHAR|LOWCHAR)
块,同时仍然创建一个 IDENTIFIER
令牌".显然,创建令牌的最少数字是一个,因此这实际上是编写 IDENTIFIER
以匹配单个字符的一种非常低效的方式.parse
规则未能处理这个问题的原因是它只允许一个 IDENTIFIER
标记出现在 KEYWORD
标记之前.通过像我上面展示的那样创建解析器规则 identifier
,解析器将能够将 IDENTIFIER
标记序列(每个标记为单个字符)视为单个标识符.
The behavior you observed in ANTLR 4 using the non-greedy operator +?
is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR)
blocks as possible while still creating an IDENTIFIER
token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER
to match a single character. The reason the parse
rule failed to handle this is it only allows a single IDENTIFIER
token to appear before the KEYWORD
token. By creating a parser rule identifier
like I showed above, the parser would be able to treat sequences of IDENTIFIER
tokens (which are each a single character), as a single identifier.
您在 ANTLR 3 中收到消息以下替代方案永远无法匹配..."的原因是静态分析已确定规则 IDENTIFIER 中的正闭包
永远不会匹配多于 1 个字符,因为规则总是在 恰好 1 个字符时成功.
The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER
will never match more than 1 character because the rule will always be successful with exactly 1 character.
这篇关于Lexer 前瞻如何与 ANTLR3 和 ANTLR4 中的贪婪和非贪婪匹配一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!