问题描述
我正在编写一个编译器构建课程,我目前的任务是为我们正在实现的语言编写词法分析器。我不知道如何满足词法分析器必须识别连接的令牌的要求。也就是说,令牌不被空格分隔。例如:字符串 39if
应该被识别为数字 39
和关键字 if
。同时,当遇到无效输入时,词法分析器还必须退出(1)
。
代码的简化版本我有:
%{
#include< stdio.h>
%}
%选项主警告调试
%%
if |
then |
else printf(keyword:%s\\\
,yytext);
[[:digit:]] + printf(number:%s\\\
,yytext);
[[:alpha:]] [[:alnum:]] * printf(identifier:%s\\\
,yytext);
[[:space:]] + // skip whitespace
[[:^ space:]] + {printf(ERROR:%s\\\
,yytext) exit(1); }
%%
当我运行这个(或我的完整版本) ,并传递输入 39if
,则错误规则匹配,输出为 ERROR:39if
d喜欢:
number:39
关键字:if
(例如,如果输入 39如果输入
>
,我有一个预感,原因是错误规则匹配比数字和关键字规则更长的可能输入,并且flex会更喜欢它。也就是说,我不知道如何解决这种情况。编写一个将拒绝所有无错误输入的显式regexp似乎是不可行的,我不知道如何编写一个catch-all规则来处理lexer错误。
UPDATE:我想我可以让catch-all规则为。 {exit(1); }
但是我想要得到比我在第1行困惑更好的调试输出。
你是对的,你应该匹配单个任何字符作为后备。获取关于解析行在哪里的信息的标准方式是使用 - bison-bridge
选项,但这可能有点痛苦,特别是如果你不使用 bison
。还有一些其他的方法 - 看看手册中的方式来指定自己的i / o功能,例如 - 但是最简单的IMHO是使用开始条件:
%x LEXING_ERROR
%%
//所有规则;以下*必须*在结束
。 {BEGIN(LEXING_ERROR); yyless(1); }
< LEXING_ERROR>。+ {fprintf(stderr,
在第%d行找到无效字符'%c',在'%s'\\\
'前面'
'
* yytext,yylineno,yytext + 1);
exit(1);
}
注意:请确保您已忽略规则中的空格。模式。+
匹配任何数字,但至少有一个非换行符,换句话说,直到当前行的末尾(它将强制flex读取那个,这不应该是一个问题)。 yyless(n)
用 n
字符备份读指针,因此在 / code>规则匹配,它将重新扫描该字符产生(希望)半合理的错误消息。 (如果你的输入是多字节的,或者有奇怪的控制字符,那么你就不会真的合理,所以你可以写更仔细的代码,直到你,如果错误在一行的结尾也可能不合理,你可能还想编写一个更小心的正则表达式,它获得更多的上下文,甚至可能限制正向字符的读取数量。)
查找开始弹出手册中有关%x
和 BEGIN
的条件
I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if
is supposed to be recognized as the number 39
and the keyword if
. Simultaneously, the lexer must also exit(1)
when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if
, the error rule is matched and the output is ERROR: 39if
, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if
as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); }
but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge
option, but that can be a bit of a pain, particularly if you're not using bison
. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+
matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n)
backs up the read pointer by n
characters, so after the .
rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x
and BEGIN
这篇关于如何让lex / flex识别不被空格分隔的令牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!