问题描述
我有一个 antlr4 词法分析器语法.它有很多单词规则,但我也希望它为其他规则无法匹配的任何单词创建一个未知标记.我有这样的事情:
空格:[ \t\n\r]+ ->跳过;标点符号:[.,:;?!];//其他规则在这里未知:.+?;
现在生成的匹配器将 '~' 捕获为未知,但为输入 '~~~' 创建了 3 个 '~' 未知标记,而不是单个 '~~~' 标记.我该怎么做才能告诉词法分析器为未知的连续字符生成单词标记.我也试过未知:.;"和未知:.+;"没有结果.
在当前的 antlr 版本中.+?现在捕获剩余的单词,所以这个问题似乎解决了.
.+?
在词法分析器规则的末尾将始终匹配单个字符.但是 .+
会尽可能多地消耗,这在 ANTLR v3(v4 可能也是如此)中的规则末尾是非法的.
您可以做的只是匹配单个字符,然后在解析器中将它们粘合"在一起:
unknowns : Unknown+ ;...未知:.;
编辑
...但我只有一个词法分析器,没有解析器...
啊,我明白了.然后你可以覆盖 nextToken()
方法:
词法语法Lex;@会员{公共静态无效主(字符串 [] args){Lex lex = new Lex(new ANTLRInputStream("foo, bar...\n"));for(Token t : lex.getAllTokens()) {System.out.printf("%-15s '%s'\n", tokenNames[t.getType()], t.getText());}}私有 java.util.Queuequeue = new java.util.LinkedList();@覆盖公共令牌 nextToken() {if(!queue.isEmpty()) {返回 queue.poll();}令牌下一个 = super.nextToken();if(next.getType() != Unknown) {接下来返回;}StringBuilder builder = new StringBuilder();while(next.getType() == 未知) {builder.append(next.getText());next = super.nextToken();}//`next` _not_ 将是一个 Unknown-token,将其存储在//下一次返回的队列!queue.offer(下一个);返回新的 CommonToken(Unknown, builder.toString());}}空格:[ \t\n\r]+ ->跳过 ;标点符号:[.,:;?!];未知:.;
运行:
java -cp antlr-4.0-complete.jar org.antlr.v4.Tool Lex.g4javac -cp antlr-4.0-complete.jar *.javajava -cp .:antlr-4.0-complete.jar Lex
将打印:
未知的'foo'标点符号,"未知的酒吧"标点符号'.'标点符号'.'标点符号'.'
I have an antlr4 lexer grammar. It has many rules for words, but I also want it to create an Unknown token for any word that it can not match by other rules. I have something like this:
Whitespace : [ \t\n\r]+ -> skip;
Punctuation : [.,:;?!];
// Other rules here
Unknown : .+? ;
Now generated matcher catches '~' as unknown but creates 3 '~' Unknown tokens for input '~~~' instead of a single '~~~' token. What should I do to tell lexer to generate word tokens for unknown consecutive characters. I also tried "Unknown: . ;" and "Unknown : .+ ;" with no results.
EDIT: In current antlr versions .+? now catches remaining words, so this problem seems to be resolved.
.+?
at the end of a lexer rule will always match a single character. But .+
will consume as much as possible, which was illegal at the end of a rule in ANTLR v3 (v4 probably as well).
What you can do is just match a single char, and "glue" these together in the parser:
unknowns : Unknown+ ;
...
Unknown : . ;
EDIT
Ah, I see. Then you could override the nextToken()
method:
lexer grammar Lex;
@members {
public static void main(String[] args) {
Lex lex = new Lex(new ANTLRInputStream("foo, bar...\n"));
for(Token t : lex.getAllTokens()) {
System.out.printf("%-15s '%s'\n", tokenNames[t.getType()], t.getText());
}
}
private java.util.Queue<Token> queue = new java.util.LinkedList<Token>();
@Override
public Token nextToken() {
if(!queue.isEmpty()) {
return queue.poll();
}
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
StringBuilder builder = new StringBuilder();
while(next.getType() == Unknown) {
builder.append(next.getText());
next = super.nextToken();
}
// The `next` will _not_ be an Unknown-token, store it in
// the queue to return the next time!
queue.offer(next);
return new CommonToken(Unknown, builder.toString());
}
}
Whitespace : [ \t\n\r]+ -> skip ;
Punctuation : [.,:;?!] ;
Unknown : . ;
Running it:
java -cp antlr-4.0-complete.jar org.antlr.v4.Tool Lex.g4 javac -cp antlr-4.0-complete.jar *.java java -cp .:antlr-4.0-complete.jar Lex
will print:
Unknown 'foo' Punctuation ',' Unknown 'bar' Punctuation '.' Punctuation '.' Punctuation '.'
这篇关于在 antlr4 词法分析器中,如何有一个规则来捕获所有剩余的“单词"?作为未知令牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!