问题描述
如果我在antlr4词法分析器中有一个ONELINE_STRING片段规则,该规则在一行上标识了一个简单的带引号的字符串,我如何在词法分析器中创建一个更通用的STRING规则,该规则将连接相邻的ONELINE_STRING规则(即,仅由空格和/分隔或注释),只要它们各自从不同的行开始?
If I have a ONELINE_STRING fragment rule in an antlr4 lexer that identifies a simple quoted string on one line, how can I create a more general STRING rule in the lexer that will concatenate adjacent ONELINE_STRING's (ie, separated only by whitespace and/or comments) as long as they each start on a different line?
即
"foo" "bar"
将被解析为两个STRING令牌,"foo"后跟"bar"
would be parsed as two STRING tokens, "foo" followed by "bar"
同时:
"foo"
"bar"
将被视为一个STRING令牌:"foobar"
would be seen as one STRING token: "foobar"
为了澄清:我的想法是虽然我通常希望解析器能够将相邻的字符串识别为单独的字符串,并且解析器将忽略空格和注释,但是我想使用这样的想法:如果最后一个非空格一行上的子令牌是一个字符串,而下一行不是全部空格的第一个子令牌也是一个字符串,那么应将单独的字符串连接成一个长字符串,作为指定可能非常长的字符串的方法不必将整个事情放在一行上.如果我希望将所有相邻的字符串子标记串联在一起(例如在C中),这非常简单,但是出于我的目的,我只希望当字符串子标记在不同行上开始时才发生串联.此串联对于解析器中可能使用字符串的任何规则均不可见.这就是为什么我认为最好将规则放在词法分析器而不是解析器中,但是我并不完全反对在解析器中执行此操作,所有可能引用了STRING令牌的解析规则都将而是在需要字符串时参考解析器字符串规则.
For clarification: The idea is that while I generally want the parser to be able to recognize adjacent strings as separate, and whitespace and comments to be ignored by the parser, I want to use the idea that if the last non-whitespace sub-token on a line was a string, and the first sub-token on the next line that is not all whitespace is also a string, then the separate strings should be concatenated into one long string as a means of specifying potentially very long strings without having to put the whole thing on one line. This is very straightforward if I were wanting all adjacent string sub-tokens to be concatenated, as they are in C... but for my purposes, I only want concatenation to occur when the string sub-tokens start on different lines. This concatenation should be invisible to any rule in the parser that might use a string. This is why I was thinking it might be better to situate the rule inside the lexer instead of the parser, but I'm not wholly opposed to doing this in the parser, and all the parsing rules which might have referred to a STRING token would instead refer to the parser string rule whenever they want a string.
样本1:
"desc" "this sample will parse as two strings.
Sample3(请注意,"output"是该语言的关键字):
Sample3 (note, 'output' is a keyword in the language):
output "this is a very long line that I've explicitly made so that it does not "
"easily fit on just one line, so it gets split up into separate ones for "
"ease of reading, but the parser should see it all as one long string. "
"This example will parse as if the output command had been followed by "
"only a single string, even though it is composed of multiple string "
"fragments, all of which should be invisible to the parser.%n";
以上两个示例均应被解析器接受为有效.前者是声明的示例,而后者是该语言中的命令式声明的示例.
Both of these examples should be accepted as valid by the parser. The former is an example of a declaration, while the latter is an example of an imperative statement in the language.
附录:
我原本以为这需要在词法分析器中完成,因为尽管解析器应该像其他所有空格一样将换行符忽略换行符,但是多行字符串实际上对我不认为的换行符敏感解析器可以感知到的.
I had originally been thinking that this would need to be done in the lexer because although newlines are supposed to be ignored by the parser, like all other whitespace, a multiline string is actually sensitive to the presence of newlines I did not think that the parser could perceive that.
但是,我一直在考虑将ONELINE_STRING用作词法分析器规则,并且有一个通用的字符串"解析器规则可以检测相邻的ONELINE_STRINGS,并使用字符串之间的谓词来检测下一个ONELINE_STRING令牌是否存在.从与上一行不同的行开始,如果是这样,则应该以无形方式将它们连接起来,以便其文本与在同一行上全部指定的字符串无法区分.但是,我不确定如何实现此目标.
However, I have been thinking that it may be possible to have the ONELINE_STRING as a lexer rule, and have a general 'string' parser rule which detects adjacent ONELINE_STRINGS, using a predicate between strings to detect if the next ONELINE_STRING token is starting on a different line than the previous one, and if so, it should invisibly concatenate them so that its text is indistinguishable from a string that had been specified all on one line. I am unsure of the logistics of how this would be implemented, however.
好的,我有.
正如某些人建议的那样,我需要在解析器中具有字符串识别器.诀窍是在词法分析器中使用词法分析器模式.
I need to have the string recognizer in the parser, as some of you have suggested. The trick is to use lexer modes in the lexer.
因此在Lexer文件中,我有这个:
So in the Lexer file I have this:
BEGIN_STRING : '"' -> pushMode(StringMode);
mode StringMode;
END_STRING: '"'-> popMode;
STRING_LITERAL_TEXT : ~[\r\n%"];
STRING_LITERAL_ESCAPE_QUOTE : '%"' { setText("\""); };
STRING_LITERAL_ESCAPE_PERCENT: '%%' { setText("%"); };
STRING_LITERAL_ESCAPE_NEWLINE : '%n'{ setText("\n"); };
UNTERMINATED_STRING: { _input.LA(1) == '\n' || _input.LA(1) == '\r' || _input.LA(1) == EOF}? -> popMode;
在解析器文件中,我有这个:
And in the parser file I have this:
string returns [String text] locals [int line] : a=stringLiteral { $line = $a.line; $text=$a.text;}
({_input.LT(1)!=null && _input.LT(1).getLine()>$line}?
a=stringLiteral { $line = $a.line; $text+=$a.text; })*
;
stringLiteral returns [int line, String text]: BEGIN_STRING {$text = "";}
(a=(STRING_LITERAL_TEXT
| STRING_LITERAL_ESCAPE_NEWLINE
| STRING_LITERAL_ESCAPE_QUOTE
| STRING_LITERAL_ESCAPE_PERCENT
) {$text+=$a.text;} )*
stringEnd { $line = $BEGIN_STRING.line; }
;
stringEnd: END_STRING #string_finish
| UNTERMINATED_STRING #string_hang
;
因此,字符串规则将相邻的字符串文字串联在一起,只要它们位于不同的行上即可.当字符串文字未正确终止时,stringEnd规则需要一个事件处理程序,以便解析器可以报告语法错误,但在其他情况下,该字符串将被视为已正确关闭.
The string rule thus concatenates adjacent string literals as long as they are on different lines. The stringEnd rule needs an event handler for when a string literal is not terminated correctly so that the parser can report a syntax error, but the string is otherwise treated as if it had been closed correctly.
推荐答案
如前所述,(IMO)更好的方法是在解析器内部进行处理.但这是在词法分析器中处理它的一种方法:
As already mentioned, the (IMO) better way would be to handle this inside the parser. But here's a way to handle it in the lexer:
STRING
: SINGLE_STRING ( LINE_CONTINUATION SINGLE_STRING )*
;
HIDDEN
: ( SPACE | LINE_BREAK | COMMENT ) -> channel(HIDDEN)
;
fragment SINGLE_STRING
: '"' ~'"'* '"'
;
fragment LINE_CONTINUATION
: ( SPACE | COMMENT )* LINE_BREAK ( SPACE | COMMENT )*
;
fragment SPACE
: [ \t]
;
fragment LINE_BREAK
: [\r\n]
| '\r\n'
;
fragment COMMENT
: '//' ~[\r\n]+
;
标记输入内容
"a" "b"
"c"
"d"
"e"
"f"
将创建以下5个令牌:
-
"a"
-
"b"
-
"c"\n"d"
-
"e"
-
"f"
"a"
"b"
"c"\n"d"
"e"
"f"
但是,如果令牌中包含注释:
However, if the token would include a comment:
"c" // comment
"d"
然后,您需要稍后自己从令牌中剥离此"// comment"
.词法分析器将无法将此子字符串放在其他通道上,也不能skip
放置在该通道上.
then you'd need to strip this "// comment"
from the token yourself at a later stage. The lexer will not be able to put this substring on a different channel, or skip
it.
这篇关于antlr4多行字符串解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!