本文介绍了Antlr4无法正确识别Unicode字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的语法,试图将'é'与令牌E_CODE匹配.我已经使用TestRig工具(带有-tokens选项)对其进行了测试,但是解析器无法正确匹配它.我的输入文件是使用UTF-8编码的,没有BOM,并且我使用的是ANTLR 4.4版.其他人也可以检查一下吗?我在控制台上得到了以下输出:
第1:0行令牌识别错误:Ă"

I've very simple grammar which tries to match 'é' to token E_CODE. I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it. My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4. Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'

grammar Unicode;

stat:EOF;  
E_CODE: '\u00E9' | 'é';

推荐答案

我测试了语法:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '\u00E9' | 'é';

如下:

UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

,以下内容已打印到我的控制台上:

and the following got printed to my console:

éé<EOF>

使用4.2和4.3进行了测试(4.4还没有在Maven Central中使用).

Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).

查看源文件我看到TestRig带有一个可选的-encoding参数.您是否尝试过设置?

Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?

这篇关于Antlr4无法正确识别Unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-10 20:31