问题描述
我目前有一种使用 ANTLR 在 Java 中实现的简单有效的语言.我想要做的是将它嵌入到纯文本中,以类似于 PHP 的方式.
例如:
Lorem ipsum dolor sat amet<% print('consectetur adipiscing elit');%>Phasellus volutpat dignissim sapien.
我预计生成的令牌流将类似于:
CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA
我怎样才能做到这一点,或者有更好的方法吗?
对于 块之外的内容没有限制.我假设了类似
<% print('%>');%>
,根据 Michael Mrozek 的回答,是可能的,但在这种情况之外, 将始终指示代码块的开始.
示例实现
我根据 Michael Mrozek 的回答中给出的想法开发了一个解决方案,使用 ANTLR 的门控语义谓词模拟 Flex 的启动条件:
lexer 语法 Lexer;@会员{布尔代码模式 = 假;}打开:{!codeMode}?=>''%>'{ codeMode = false;} ;LPAREN : {codeMode}?=>'(';//等等.字符:{!codeMode}?=>~('声明+;逐字逐句:字符 ->^(逐字字符);
实际概念看起来不错,尽管您不太可能拥有 PRINT 令牌;词法分析器可能会发出类似 IDENTIFIER 的信息,解析器将负责确定它是一个函数调用(例如,通过查找 IDENTIFIER OPAREN ... CPAREN
)并执行适当的操作.>
至于怎么做,我对ANTLR一无所知,但它可能有类似flex的开始条件.如果是这样,您可以让 INITIAL
开始条件只查找 ,这将切换到
CODE
状态,其中所有定义了实际的令牌;然后 '%>' 会切换回来.在 flex 中,它将是:
%s CODE%%<初始>{<%"{开始(代码);}.{}}/* 所有这些都隐含在 CODE 中,因为它被声明为 %s,但您也可以将其包装在 {} 中*/%>"{开始(初始);}"(" {返回 OPAREN;}"'" {返回 APOS;}...
您需要注意诸如在不是结束标记的上下文中匹配 %>
之类的事情,例如在字符串中;如果你想允许 <% print('%>');%>
,但很可能你会这样做
I currently have a working, simple language implemented in Java using ANTLR. What I want to do is embed it in plain text, in a similar fashion to PHP.
For example:
Lorem ipsum dolor sit amet
<% print('consectetur adipiscing elit'); %>
Phasellus volutpat dignissim sapien.
I anticipate that the resulting token stream would look something like:
CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA
How can I achieve this, or is there a better way?
There is no restriction on what might be outside the <%
block. I assumed something like <% print('%>'); %>
, as per Michael Mrozek's answer, would be possible, but outside of a situation like that, <%
would always indicate the start of a code block.
Sample Implementation
I developed a solution based on ideas given in Michael Mrozek's answer, simulating Flex's start conditions using ANTLR's gated semantic predicates:
lexer grammar Lexer;
@members {
boolean codeMode = false;
}
OPEN : {!codeMode}?=> '<%' { codeMode = true; } ;
CLOSE : {codeMode}?=> '%>' { codeMode = false;} ;
LPAREN : {codeMode}?=> '(';
//etc.
CHAR : {!codeMode}?=> ~('<%');
parser grammar Parser;
options {
tokenVocab = Lexer;
output = AST;
}
tokens {
VERBATIM;
}
program :
(code | verbatim)+
;
code :
OPEN statement+ CLOSE -> statement+
;
verbatim :
CHAR -> ^(VERBATIM CHAR)
;
The actual concept looks fine, although it's unlikely you'd have a PRINT token; the lexer would probably emit something like IDENTIFIER, and the parser would be responsible for figuring out that it's a function call (e.g. by looking for IDENTIFIER OPAREN ... CPAREN
) and doing the appropriate thing.
As for how to do it, I don't know anything about ANTLR, but it probably has something like flex's start conditions. If so, you can have the INITIAL
start condition do nothing but look for <%
, which would switch to the CODE
state where all the actual tokens are defined; then '%>' would switch back. In flex it would be:
%s CODE
%%
<INITIAL>{
"<%" {BEGIN(CODE);}
. {}
}
/* All these are implicitly in CODE because it was declared %s,
but you could wrap it in <CODE>{} too
*/
"%>" {BEGIN(INITIAL);}
"(" {return OPAREN;}
"'" {return APOS;}
...
You need to be careful about things like matching %>
in a context where it's not a closing marker, like within a string; it's up to you if you want to allow <% print('%>'); %>
, but most likely you do
这篇关于我如何对这个输入进行词法分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!