我如何对这个输入进行词法分析

我如何对这个输入进行词法分析

本文介绍了我如何对这个输入进行词法分析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一种使用 ANTLR 在 Java 中实现的简单有效的语言.我想要做的是将它嵌入到纯文本中,以类似于 PHP 的方式.

例如:

Lorem ipsum dolor sat amet<% print('consectetur adipiscing elit');%>Phasellus volutpat dignissim sapien.

我预计生成的令牌流将类似于:

CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA

我怎样才能做到这一点,或者有更好的方法吗?

对于 块之外的内容没有限制.我假设了类似 <% print('%>');%>,根据 Michael Mrozek 的回答,是可能的,但在这种情况之外, 将始终指示代码块的开始.

示例实现

我根据 Michael Mrozek 的回答中给出的想法开发了一个解决方案,使用 ANTLR 的门控语义谓词模拟 Flex 的启动条件:

lexer 语法 Lexer;@会员{布尔代码模式 = 假;}打开:{!codeMode}?=>''%>'{ codeMode = false;} ;LPAREN : {codeMode}?=>'(';//等等.字符:{!codeMode}?=>~('声明+;逐字逐句:字符 ->^(逐字字符);
解决方案

实际概念看起来不错,尽管您不太可能拥有 PRINT 令牌;词法分析器可能会发出类似 IDENTIFIER 的信息,解析器将负责确定它是一个函数调用(例如,通过查找 IDENTIFIER OPAREN ... CPAREN)并执行适当的操作.>

至于怎么做,我对ANTLR一无所知,但它可能有类似flex的开始条件.如果是这样,您可以让 INITIAL 开始条件只查找 ,这将切换到 CODE 状态,其中所有定义了实际的令牌;然后 '%>' 会切换回来.在 flex 中,它将是:

%s CODE%%<初始>{<%"{开始(代码);}.{}}/* 所有这些都隐含在 CODE 中,因为它被声明为 %s,但您也可以将其包装在 {} 中*/%>"{开始(初始);}"(" {返回 OPAREN;}"'" {返回 APOS;}...

您需要注意诸如在不是结束标记的上下文中匹配 %> 之类的事情,例如在字符串中;如果你想允许 <% print('%>');%>,但很可能你会这样做

I currently have a working, simple language implemented in Java using ANTLR. What I want to do is embed it in plain text, in a similar fashion to PHP.

For example:

Lorem ipsum dolor sit amet
<% print('consectetur adipiscing elit'); %>
Phasellus volutpat dignissim sapien.

I anticipate that the resulting token stream would look something like:

CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA

How can I achieve this, or is there a better way?

There is no restriction on what might be outside the <% block. I assumed something like <% print('%>'); %>, as per Michael Mrozek's answer, would be possible, but outside of a situation like that, <% would always indicate the start of a code block.


Sample Implementation

I developed a solution based on ideas given in Michael Mrozek's answer, simulating Flex's start conditions using ANTLR's gated semantic predicates:

lexer grammar Lexer;

@members {
    boolean codeMode = false;
}

OPEN    : {!codeMode}?=> '<%' { codeMode = true; } ;
CLOSE   : {codeMode}?=> '%>' { codeMode = false;} ;
LPAREN  : {codeMode}?=> '(';
//etc.

CHAR    : {!codeMode}?=> ~('<%');


parser grammar Parser;

options {
    tokenVocab = Lexer;
    output = AST;
}

tokens {
    VERBATIM;
}

program :
    (code | verbatim)+
    ;

code :
    OPEN statement+ CLOSE -> statement+
    ;

verbatim :
    CHAR -> ^(VERBATIM CHAR)
    ;
解决方案

The actual concept looks fine, although it's unlikely you'd have a PRINT token; the lexer would probably emit something like IDENTIFIER, and the parser would be responsible for figuring out that it's a function call (e.g. by looking for IDENTIFIER OPAREN ... CPAREN) and doing the appropriate thing.

As for how to do it, I don't know anything about ANTLR, but it probably has something like flex's start conditions. If so, you can have the INITIAL start condition do nothing but look for <%, which would switch to the CODE state where all the actual tokens are defined; then '%>' would switch back. In flex it would be:

%s CODE

%%

<INITIAL>{
    "<%"      {BEGIN(CODE);}
    .         {}
}

 /* All these are implicitly in CODE because it was declared %s,
    but you could wrap it in <CODE>{} too
  */
"%>"          {BEGIN(INITIAL);}
"("           {return OPAREN;}
"'"           {return APOS;}
...

You need to be careful about things like matching %> in a context where it's not a closing marker, like within a string; it's up to you if you want to allow <% print('%>'); %>, but most likely you do

这篇关于我如何对这个输入进行词法分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 01:20