如何使用flex/bison解析Scala语法中的新行?

本文介绍了如何使用flex/bison解析Scala语法中的新行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用flex和bison解析Scala语法.但是我不知道如何在Scala语法中解析换行符.

I want to parse Scala grammar with flex and bison. But I don't know how to parse the newline token in Scala grammar.

如果我将换行符解析为令牌 T_NL ，则这是 Toy.l ，例如:

If I parse newline as a token T_NL, Here's the Toy.l for example:

...
[a-zA-Z_][a-zA-Z0-9_]*  { yylval->literal = strdup(yy_text); return T_ID; }
\n                      { yylval->token = T_LN; return T_LN; }
[ \t\v\f\r]             { /* skip whitespaces */ }
...

这是 Toy.y 的示例，

function_def: 'def' T_ID '(' argument_list ')' return_expression '=' expression T_NL
            ;

argument_list: argument
             | argument ',' argument_list
             ;

expression: ...
          ;

return_expression: ...
                 ;

您可能看到我必须在 Toy.y 中的所有其他语句和定义中跳过 T_NL ，这确实很无聊.

You could see that I have to skip T_NL in all other statements and definitions in Toy.y, which is really boring.

请向我提供源代码示例！

Please educate me with source code example!

推荐答案

在明显的情况下，使用bison推式解析器很有用.基本思想是，只有在识别出以下令牌(并且在一个特殊情况下，是第二个令牌)时，才可以决定发送NL令牌(或多个令牌).

This is a clear case where bison push-parsers are useful. The basic idea is that the decision to send an NL token (or tokens) can only be made when the following token has been identified (and, in one corner case, the second following token).

推式解析器的优点在于，它们使我们可以实施这样的策略，即输入词素与发送到解析器的令牌之间不一定存在一对一的关系.我不会处理设置推式解析器的所有特殊问题(尽管并不困难).有关详细信息，请参阅 bison手册.[注1]

The advantage of push parsers is that they let us implement strategies like this, where there is not necessarily a one-to-one relationship between input lexemes and tokens sent to the parser. I'm not going to deal with all the particularities of setting up a push parser (though it's not difficult); you should refer to the bison manual for details. [Note 1]

首先，请务必仔细阅读Scala语言说明.本节中介绍了换行处理2.13 :

First, it's important to read the Scala language description with care. Newline processing is described in section 2.13:

换行符之前的令牌可以终止一条语句.
紧接换行符的令牌可以开始声明.
令牌显示在启用换行符的区域中.

规则1和2是简单的查找表，在以下两段中对其进行了精确定义.规则2只有一个小例外，有一个小例外，如下所述:

Rules 1 and 2 are simple lookup tables, which are precisely defined in the following two paragraphs. There is just one minor exception to rule 2 has a minor exception, described below:

处理该异常的一种可能的办法是添加 case [[:space:]] + class 和 case [[:space:]] + object 作为词素，假设没有人会在 case 和 class 之间添加注释.(或者，您可以使用更复杂的模式，该模式允许注释和空格.)如果识别出这些词素之一，则可以将其作为单个(融合)令牌发送至解析器，也可以将其作为两个令牌发送词法分析器操作中使用两个 SEND 调用的标记.(个人而言，我将使用融合令牌，因为一旦识别出一对令牌，将它们拆分就没有任何优势； afaik，没有有效的程序可以在其中使用 case class 解析为除 case class 之外的任何其他内容，但我可能是错的.)

One hackish possibility to deal with that exception would be to add case[[:space:]]+class and case[[:space:]]+object as lexemes, on the assumption that no-one will put a comment between case and class. (Or you could use a more sophisticated pattern, which allows comments as well as whitespace.) If one of these lexemes is recognised, it could either be sent to the parser as a single (fused) token, or it could be sent as two tokens using two invocations of SEND in the lexer action. (Personally, I'd go with the fused token, since once the pair of tokens has been recognised, there is no advantage to splitting them up; afaik, there's no valid program in which case class can be parsed as anything other than case class. But I could be wrong.)

要应用规则一和规则二，我们需要两个通过令牌编号索引的查找表: token_can_end_stmt 和 token_cannot_start_stmt .(第二个含义相反，因为大多数令牌都可以启动语句；以这种方式简化了初始化.)

To apply rules one and two, then, we need two lookup tables indexed by token number: token_can_end_stmt and token_cannot_start_stmt. (The second one has its meaning reversed because most tokens can start statements; doing it this way simplifies initialisation.)

        /* YYNTOKENS is defined by bison if you specify %token-tables */
        static bool token_can_end_stmt[YYNTOKENS] = {
          [T_THIS] = true, [T_NULL] = true, /* ... */
        };
        static bool token_cannot_start_stmt[YYNTOKENS] = {
          [T_CATCH] = true, [T_ELSE] = true, /* ... */
        };

我们将需要一点持久性状态，但是幸运的是，当我们使用推式解析器时，扫描器不需要在每次识别令牌时都返回其调用者，因此我们可以保持持久性状态在扫描循环中作为局部变量.(这是推解析器体系结构的另一个优点.)

We're going to need a little bit of persistent state, but fortunately when we're using a push parser, the scanner does not need to return to its caller every time it recognises a token, so we can keep the persistent state as local variables in the scan loop. (That's another advantage of the push-parser architecture.)

从上面的描述中，我们可以看到我们需要在扫描器状态下维护的内容是:

From the above description, we can see that what we're going to need to maintain in the scanner state are:

表明已遇到换行符.这应该是一个计数，而不是布尔值，因为我们可能需要发送两条换行符:

some indication that a newline has been encountered. This needs to be a count, not a boolean, because we might need to send two newlines:

上一个标记(用于规则1).只需记录令牌号，这是一个简单的小整数.

the previous token (for rule 1). It's only necessary to record the token number, which is a simple small integer.

告诉我们是否在启用换行符的区域"中的某种方法.(对于规则3).我很确定这将需要解析器的帮助，所以我在这里已经写过了.

some way of telling whether we're in a "region where newlines are enabled" (for rule 3). I'm pretty sure that this will require assistance from the parser, so I've written it that way here.

通过集中将换行符发送到单个函数的决策，我们可以避免很多代码重复.我的典型推式解析器体系结构始终使用 SEND 宏来处理保存语义值，调用解析器和检查错误的样板.可以很容易地在其中添加换行逻辑:

By centralising the decision about sending a newline into a single function, we can avoid a lot of code duplication. My typical push-parser architecture uses a SEND macro anyway, to deal with the boilerplate of saving the semantic value, calling the parser, and checking for errors; it's easy to add the newline logic there:

// yylloc handling mostly omitted for simplicity
#define SEND_VALUE(token, tag, value) do {                             \
    yylval.tag = value;                                                \
    SEND(token);                                                       \
} while(0);

#define SEND(token) do {                                               \
    int status = YYPUSH_MORE;                                          \
    if (yylineno != prev_lineno)                                       \
            && token_can_end_stmt[prev_token]                          \
            && !token_cannot_start_stmt[token]                         \
            && in_new_line_region) {                                   \
      status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled);     \
      if (status == YYPUSH_MORE                                        \
          && yylineno - prev_lineno > 1)                               \
        status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled);   \
    }                                                                  \
    nl_encountered = 0;                                                \
    if (status == YYPUSH_MORE)                                         \
      status = yypush_parse(ps, token, &yylval, &yylloc, &nl_enabled); \
    if (status != YYPUSH_MORE) return status;                          \
    prev_token = token;                                                \
    prev_lineno = yylineno;                                            \
while (0);

在扫描仪中指定本地状态非常简单；只需将声明和初始化放在扫描仪规则的顶部，以缩进形式即可.将第一个规则之前的任何缩进代码都直接插入中，几乎在函数的顶部(因此，每个调用执行一次，而不是每个lexeme执行一次):

Specifying the local state in the scanner is extremely simple; just place the declarations and initialisations at the top of your scanner rules, indented. Any indented code prior to the first rule is inserted directly into yylex, almost at the top of the function (so it is executed once per call, not once per lexeme):

%%
    int nl_encountered = 0;
    int prev_token = 0;
    int prev_lineno = 1;
    bool nl_enabled = true;
    YYSTYPE yylval;
    YYLTYPE yylloc = {0};

现在，各个规则非常简单(除了 case ).例如，我们可能有类似的规则:

Now, the individual rules are pretty simple (except for case). For example, we might have rules like:

"while"                     { SEND(T_WHILE); }
[[:lower:]][[:alnum:]_]*    { SEND_VALUE(T_VARID, str, strdup(yytext)); }

仍然存在如何确定我们是否在启用换行符的区域中的问题.

That still leaves the question of how to determine if we are in a region where newlines are enabled.

大多数规则可以在词法分析器中处理，只需保留一堆不同类型的开放括号并检查栈顶即可:如果栈顶上的括号是 {，然后启用换行符；否则，事实并非如此.因此我们可以使用类似的规则:

Most of the rules could be handled in the lexer by just keeping a stack of different kinds of open parentheses, and checking the top of the stack: If the parenthesis on the top of the stack is a {, then newlines are enabled; otherwise, they are not. So we could use rules like:

[{([]   { paren_stack_push(yytext[0]); SEND(yytext[0]); }
[])}]   { paren_stack_pop(); SEND(yytext[0]); }

但是，这不能满足在 case 及其对应的 => 之间禁用换行符的要求.我认为不可能将其作为另一种括号来处理，因为 => 的许多用法与 case 不对应，我相信其中一些可以介于 case 和相应的 => 之间.

However, that doesn't deal with the requirement that newlines be disabled between a case and its corresponding =>. I don't think it's possible to handle that as another type of parenthesis, because there are lots of uses of => which do not correspond with a case and I believe some of them can come between a case and it's corresponding =>.

因此，更好的方法是将这种逻辑放入解析器中，使用词法反馈来传达换行区域堆栈的状态，这是上面对 yypush_parse 的调用所假定的.具体来说，它们在扫描程序和解析器之间共享一个布尔变量(通过将指针传递给解析器).[注3]然后，解析器使用解析堆栈本身作为堆栈，在每个规则中维护与潜在不同换行区域匹配的每个规则中的MRA中的此变量的值.这是(理论上的)解析器的一小段摘录:

So a better approach would be to put this logic into the parser, using lexical feedback to communicate the state of the newline-region stack, which is what is assumed in the calls to yypush_parse above. Specifically, they share one boolean variable between the scanner and the parser (by passing a pointer to the parser). [Note 3] The parser then maintains the value of this variable in MRAs in each rule which matches a region of potentially different newlinedness, using the parse stack itself as a stack. Here's a small excerpt of a (theoretical) parser:

%define api.pure full
%define api.push-pull push
%locations
%parse-param { bool* nl_enabled; }
/* More prologue */
%%
// ...
/* Some example productions which modify nl_enabled: */

/* The actions always need to be before the token, because they need to take
 * effect before the next lookahead token is requested.
 * Note how the first MRA's semantic value is used to keep the old value
 * of the variable, so that it can be restored in the second MRA.
 */
TypeArgs          :  <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
                     '[' Types
                     { *nl_enabled = $1; }  ']'

CaseClause        :  <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
                     "case" Pattern opt_Guard
                     { *nl_enabled = $1; }   "=>" Block

FunDef             : FunSig opt_nl
                     <boolean>{ $$ = *nl_enabled; *nl_enabled = true; }
                     '{' Block
                     { *nl_enabled = $3; }   '}'

`注意:`

推送解析器还有很多其他优点；恕我直言，他们是他们的选择解决方案.特别是，使用推式解析器可以避免循环标头的依赖性，而这种依赖性困扰了尝试构建纯解析器/扫描器组合的尝试.

Push parsers have many other advantages; IMHO they are they are the solution of choice. In particular, using push parsers avoids the circular header dependency which plagues attempts to build pure parser/scanner combinations.

仍然存在多行注释以及前后文本的问题:

There is still the question of multiline comments with preceding and trailing text:

return   /* Is this comment
            a newline?       */ 42

我不会尝试回答这个问题.

I'm not going to try to answer that question.)

可以将此标记保留在YYLTYPE结构中，因为在此示例中仅使用了 yylloc 的一个实例.这可能是合理的优化，因为它减少了发送到 yypush_parse 的参数数量.但这似乎有点脆弱，所以我在这里选择了更通用的解决方案.

It would be possible to keep this flag in the YYLTYPE structure, since only one instance of yylloc is ever used in this example. That might be a reasonable optimisation, since it cuts down on the number of parameters sent to yypush_parse. But it seemed a bit fragile, so I opted for a more general solution here.

                        这篇关于如何使用flex/bison解析Scala语法中的新行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！