问题描述
在使用 DCG 解析 3 GB 的大型文件时,效率非常重要.
我的词法分析器的当前版本主要使用 or 谓词 ;/2 但我读到索引可以提供帮助.
索引 是一种用于快速选择一种特定目标的谓词.在大多数 Prolog 系统中,索引是done (仅)在 head 的第一个参数上.如果这个论点是实例化为原子、整数、浮点数或带有函子的复合项,散列用于快速选择第一个参数的所有子句可以与目标的第一个参数一致.SWI-Prolog 支持即时和多参数索引.请参阅 2.18 部分.
谁能举一个使用索引进行词法分析的例子,并可能解释它如何提高效率?
详情
注意:在将源代码处理到这个问题之前,我更改了一些名称.如果您发现错误,请随时在此处编辑或给我留言,我会很乐意改正.
目前我的词法分析器/标记器(基于 mzapotoczny/prolog-interpreter parser.pl) 是这个
% N.B.% 由于词法分析器使用 "" 作为值,double_quotes 标志必须设置为 `chars`.% 如果 double_quotes 标志设置为 `code`,带有 "" 的值将不匹配.:- use_module(library(pio)).:- use_module(library(dcg/basics)).:- set_prolog_flag(double_quotes,chars).词法分析器(令牌)-->white_space,(( ":", !, { Token = tokColon };"(", !, { Token = tokLParen };")", !, { Token = tokRParen };"{", !, { Token = tokLMusta};"}", !, { Token = tokRMusta};"\\", !, { Token = tokSlash};"->", !, { Token = tokImpl};"+", !, { Token = tokPlus };"-", !, { Token = tokMinus };"*", !, { Token = tokTimes };"=", !, { Token = tokEqual };"<", !, { Token = tokLt };">", !, { Token = tokGt };"_", !, { Token = tokUnderscore };".", !, { Token = tokPeriod };"/", !, { Token = tokForwardSlash };",", !, { Token = tokComma };";", !, { Token = tokSemicolon };数字(D), !,数量(D,N),{ 令牌 = tokNumber(N) };字母(L),!,标识符(L,Id),{ 成员((Id, Token), [ (div, tokDiv),(mod,tokMod),(where, tokWhere)]),!;令牌 = tokVar(Id)};[_],{ 令牌 = tokUnknown }),!,{ 代币 = [代币 |TokList] },词法分析器(TokList);[],{ 代币 = [] }).white_space -->[Char], { code_type(Char, space) }, !, white_space.white_space -->"--", Whole_line, !, white_space.white_space -->[].整行 -->"\n", !.整行 -->[_],全行.数字(D) -->[D],{ 代码类型(D,数字)}.数字([D|T]) -->数字(D),!,数字(T).数字([]) -->[].数字(D, N) -->数字(Ds),{ number_chars(N, [D|Ds]) }.字母 (L) -->[L], { code_type(L, alpha) }.字母 ([A|T]) -->[A], { code_type(A, alnum) }, !, alphanum(T).字母 ([]) -->[].字母数字([]).alphanum([H|T]) :- code_type(H, alpha), alphanum(T).标识符(L,Id)-->字母(As),{ atom_codes(Id, [L|As]) }.
以下是一些用于开发和测试的辅助谓词.
read_file_for_lexing_and_user_review(Path) :-打开(路径,读取,输入),read_input_for_user_review(输入),!,关闭(输入).read_file_for_lexing_and_performance(Path,Limit) :-打开(路径,读取,输入),read_input_for_performance(Input,0,Limit), !,关闭(输入).读取输入(输入):-at_end_of_stream(输入).读取输入(输入):-\+ at_end_of_stream(输入),read_string(Input, "\n", "\r\t ", _, Line),lex_line(线),读取输入(输入).read_input_for_user_review(输入):-at_end_of_stream(输入).read_input_for_user_review(输入):-\+ at_end_of_stream(输入),read_string(Input, "\n", "\r\t ", _, Line),lex_line_for_user_review(行),NL,打印('按空格键继续或任何其他键退出:'),get_single_char(Key),process_user_continue_or_exit_key(Key,Input).read_input_for_performance(输入,计数,限制):-计数>=限制.read_input_for_performance(输入,_,_):-at_end_of_stream(输入).read_input_for_performance(Input,Count0,Limit) :-% 打印(Count0),\+ at_end_of_stream(输入),read_string(Input, "\n", "\r\t ", _, Line),lex_line(线),计数为 Count0 + 1,read_input_for_performance(输入,计数,限制).process_user_continue_or_exit_key(32,Input) :- % 空格键NL, NL,read_input_for_user_review(输入).process_user_continue_or_exit_key(Key) :-密钥 \= 32.lex_line_for_user_review(行):-lex_line(行,TokList),打印(行),NL,打印(TokList),NL.lex_line(Line,TokList) :-字符串字符(行,代码行),短语(词法分析器(TokList),Code_line).lex_line(行):-字符串字符(行,代码行),短语(词法分析器(TokList),Code_line).read_user_input_for_lexing_and_user_review :-print('输入一行进行解析或直接回车退出:'),NL,read_string(user, "\n", "\r", _, String),NL,lex_line_for_user_review(字符串),NL,continue_user_input_for_lexing_and_user_review(字符串).continue_user_input_for_lexing_and_user_review(String) :-字符串长度(字符串,N),N>0,read_user_input_for_lexing_and_user_review.continue_user_input_for_lexing_and_user_review(String) :-字符串长度(字符串,0).
read_user_input_for_lexing_and_user_review/0
允许用户在终端输入字符串以进行词法分析和检查令牌.
read_file_for_lexing_and_user_review/1
读取一个文件进行词法分析,并一次一行地查看每一行的标记.
read_file_for_lexing_and_performance/2
读取文件以进行词法分析,并限制要进行词法分析的行数.这用于收集基本性能统计数据以衡量效率.旨在与 time/1 一起使用.
解决方案:
您应该替换以下内容:
lexer(Tokens) -->white_space,(( ":", !, { Token = tokColon };"(", !, { Token = tokLParen };")", !, { Token = tokRParen };"{", !, { Token = tokLMusta};"}", !, { Token = tokRMusta};"\\", !, { Token = tokSlash};"->", !, { Token = tokImpl};"+", !, { Token = tokPlus };"-", !, { Token = tokMinus };"*", !, { Token = tokTimes };"=", !, { Token = tokEqual };"<", !, { Token = tokLt };">", !, { Token = tokGt };"_", !, { Token = tokUnderscore };".", !, { Token = tokPeriod };"/", !, { Token = tokForwardSlash };",", !, { Token = tokComma };";", !, { Token = tokSemicolon };数字(D), !,数量(D,N),{ 令牌 = tokNumber(N) };字母(L),!,标识符(L,Id),{ 成员((Id, Token), [ (div, tokDiv),(mod,tokMod),(where, tokWhere)]),!;令牌 = tokVar(Id)};[_],{ 令牌 = tokUnknown }),!,{ 代币 = [代币 |TokList] },词法分析器(TokList);[],{ 代币 = [] }).
与
lexer(Tokens) -->white_space,((op_token(令牌),!% replace ;/2 长链盲目搜索并调用新谓词 op_token//1 哪些子句在 Prolog 标准方式中通过第一个 arg 索引访问;数字(D),!,数字(D,N),{ 令牌 = tokNumber(N) };字母(L),!,标识符(L,Id),{ 成员((Id, Token), [ (div, tokDiv),(mod,tokMod),(where, tokWhere)]),!;令牌 = tokVar(Id)};[_],{ 令牌 = tokUnknown }),!,{ 代币 = [代币 |TokList] },词法分析器(TokList);[],{ 代币 = [] }).%%%op_token(tokColon) -->;".op_token(tokLParen) -->(".op_token(tokRParen) -->)".op_token(tokLMusta) -->{".op_token(tokRMusta) -->}".op_token(tokBackSlash) -->\".op_token(tokImpl) -->->".op_token(tokPlus) -->+".op_token(tokMinus) -->——".op_token(tokTimes) -->*".op_token(tokEqual) -->=".op_token(tokLt) --><".op_token(tokGt) -->>".op_token(tokUnderscore) -->_".op_token(tokPeriod) -->.".op_token(tokSlash) -->/".op_token(tokComma) -->,".op_token(tokSemicolon) -->;".
由 Guy Coder 编辑
我使用问题中发布的示例数据运行了一个测试到一个列表中,其中列表中的每个项目都是转换为字符代码的数据中的一行.然后使用 time/1 对列表中的每个项目调用词法分析器,并对列表重复测试 10000 次.将数据加载到列表中并在 time/1 之前转换为字符代码的原因是这些进程不会扭曲结果.这些运行中的每一个都重复 5 次以获得数据的一致性.
在下面的运行中,对于所有不同的版本,词法分析器被扩展为涵盖所有 7 位 ASCII 字符,这显着增加了特殊字符的案例数量.
以下使用的 Prolog 版本为 SWI-Prolog 8.0.
对于问题中的版本.
版本:1:- set_prolog_flag(double_quotes,chars).% 694,080,002 推理,151.141 CPU 在 151.394 秒内(100% CPU,4592280 唇)% 694,080,001 推理,151.059 秒内 150.813 CPU(100% CPU,4602271 唇)% 694,080,001 推理,152.063 CPU 在 152.326 秒内(100% CPU,4564439 唇)% 694,080,001 推理,151.141 CPU 在 151.334 秒内(100% CPU,4592280 唇)% 694,080,001 推理,152.139 秒内 151.875 CPU(100% CPU,4570074 唇)
对于此答案中上面发布的版本
版本:2:- set_prolog_flag(double_quotes,chars).% 773,260,002 推理,77.543 秒内 77.469 CPU(100% CPU,9981573 唇)% 773,260,001 推理,77.560 秒内 77.344 CPU(100% CPU,9997705 唇)% 773,260,001 推理,77.629 秒内 77.406 CPU(100% CPU,9989633 唇)% 773,260,001 推理,77.967 秒内 77.891 CPU(100% CPU,9927511 唇)% 773,260,001 推理,78.644 秒内 78.422 CPU(100% CPU,9860259 唇)
版本 2 通过使用版本 1 的索引进行了显着改进.
在进一步研究代码时,查看 op_token
是 DCG 并有两个隐藏变量用于隐式传递状态表示,使用 listing/1 显示:
op_token(tokUnderscore,['_'|A], A).
请注意,第一个参数不是要搜索的字符,并且在此答案中,索引代码编写为
c_digit(0'0,0).
其中第一个参数是被搜索的字符,第二个参数是结果.
所以改变这个
op_token(Token), !
到这里
[S], { special_character_indexed(S,Token) }
带有索引的子句
special_character_indexed( ';' ,tokSemicolon).
版本:3
:- set_prolog_flag(double_quotes,chars).% 765,800,002 推理,74.348 秒内 74.125 CPU(100% CPU,10331197 唇)% 765,800,001 推理,74.958 秒内 74.766 CPU(100% CPU,10242675 唇)% 765,800,001 推理,74.943 秒内 74.734 CPU(100% CPU,10246958 唇)% 765,800,001 推理,75.036 秒内 74.828 CPU(100% CPU,10234120 唇)% 765,800,001 推理,74.625 秒内 74.547 CPU(100% CPU,10272731 唇)
版本 3 的结果略好于版本 2,但始终比版本 2 更好.
最后只是将 double_quotes 标志更改为 atom
,如 AntonDanilov 的评论中所述
版本:4:- set_prolog_flag(double_quotes,atom).% 765,800,003 推理,84.539 秒内 84.234 CPU(100% CPU,9091300 唇)% 765,800,001 推理,74.930 秒内 74.797 CPU(100% CPU,10238396 唇)% 765,800,001 推理,75.303 秒内 75.125 CPU(100% CPU,10193677 唇)% 765,800,001 推理,75.218 秒内 75.078 CPU(100% CPU,10200042 唇)% 765,800,001 推理,75.281 秒内 75.031 CPU(100% CPU,10206414 唇)
第 4 版与第 3 版几乎相同.
仅查看 CPU 数量,使用索引会更快,例如(版本:1)151.875
与(版本:3)74.547
In parsing a large 3 gigabyte file with DCG, efficiency is of importance.
The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help.
Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency?
Details
Note: I changed some of the names before coping the source code into this question. If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it.
Currently my lexer/tokenizer (based on mzapotoczny/prolog-interpreter parser.pl) is this
% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).
lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).
white_space -->
[Char], { code_type(Char, space) }, !, white_space.
white_space -->
"--", whole_line, !, white_space.
white_space -->
[].
whole_line --> "\n", !.
whole_line --> [_], whole_line.
digit(D) -->
[D],
{ code_type(D, digit) }.
digits([D|T]) -->
digit(D),
!,
digits(T).
digits([]) -->
[].
number(D, N) -->
digits(Ds),
{ number_chars(N, [D|Ds]) }.
letter(L) -->
[L], { code_type(L, alpha) }.
alphanum([A|T]) -->
[A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
[].
alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).
identifier(L, Id) -->
alphanum(As),
{ atom_codes(Id, [L|As]) }.
Here are some helper predicates used for development and testing.
read_file_for_lexing_and_user_review(Path) :-
open(Path,read,Input),
read_input_for_user_review(Input), !,
close(Input).
read_file_for_lexing_and_performance(Path,Limit) :-
open(Path,read,Input),
read_input_for_performance(Input,0,Limit), !,
close(Input).
read_input(Input) :-
at_end_of_stream(Input).
read_input(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
read_input(Input).
read_input_for_user_review(Input) :-
at_end_of_stream(Input).
read_input_for_user_review(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line_for_user_review(Line),
nl,
print('Press spacebar to continue or any other key to exit: '),
get_single_char(Key),
process_user_continue_or_exit_key(Key,Input).
read_input_for_performance(Input,Count,Limit) :-
Count >= Limit.
read_input_for_performance(Input,_,_) :-
at_end_of_stream(Input).
read_input_for_performance(Input,Count0,Limit) :-
% print(Count0),
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
Count is Count0 + 1,
read_input_for_performance(Input,Count,Limit).
process_user_continue_or_exit_key(32,Input) :- % space bar
nl, nl,
read_input_for_user_review(Input).
process_user_continue_or_exit_key(Key) :-
Key \= 32.
lex_line_for_user_review(Line) :-
lex_line(Line,TokList),
print(Line),
nl,
print(TokList),
nl.
lex_line(Line,TokList) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
lex_line(Line) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
read_user_input_for_lexing_and_user_review :-
print('Enter a line to parse or just Enter to exit: '),
nl,
read_string(user, "\n", "\r", _, String),
nl,
lex_line_for_user_review(String),
nl,
continue_user_input_for_lexing_and_user_review(String).
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,N),
N > 0,
read_user_input_for_lexing_and_user_review.
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,0).
read_user_input_for_lexing_and_user_review/0
allows a user to enter a string at the terminal for lexing and review the tokens.
read_file_for_lexing_and_user_review/1
Reads a file for lexing and review the tokens for each line one line at a time.
read_file_for_lexing_and_performance/2
Reads a file for lexing with a limit on the number of lines to lex. This is for use with gathering basic performance statistics to measure efficiency. Meant to be used with time/1.
Solution:
You should replace the following:
lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).
with
lexer(Tokens) -->
white_space,
(
(
op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
;
digit(D), !, number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
;
[],
{ Tokens = [] }
).
%%%
op_token(tokColon) --> ";".
op_token(tokLParen) --> "(".
op_token(tokRParen) --> ")".
op_token(tokLMusta) --> "{".
op_token(tokRMusta) --> "}".
op_token(tokBackSlash) --> "\\".
op_token(tokImpl) --> "->".
op_token(tokPlus) --> "+".
op_token(tokMinus) --> "-".
op_token(tokTimes) --> "*".
op_token(tokEqual) --> "=".
op_token(tokLt) --> "<".
op_token(tokGt) --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod) --> ".".
op_token(tokSlash) --> "/".
op_token(tokComma) --> ",".
op_token(tokSemicolon) --> ";".
Edit by Guy Coder
I ran a test using the example data posted in the question into a list where each item in the list was a line in the data converted to character codes. Then with time/1 called lexer on each item in the list and repeated the test for the list 10000 times. The reason the data was loaded into a list and converted to characters codes before time/1 was so that those processes did not skew the results. Each of these runs was repeated 5 times to get a consistency of data.
In the following runs below, for all of the different versions the lexer was extended to cover all of the 7-bit ASCII characters which significantly increased the number of cases for special characters.
The version of Prolog used for the following was SWI-Prolog 8.0.
For the version in the question.
Version: 1
:- set_prolog_flag(double_quotes,chars).
% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)
For the version as posted above in this answer
Version: 2
:- set_prolog_flag(double_quotes,chars).
% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)
Version 2 gives a dramatic improvement by using indexing from Version 1.
In doing further research on the code, upon looking at op_token
which is DCG and has two hidden variables for implicitly passing around a state representation, using listing/1 showed:
op_token(tokUnderscore,['_'|A], A).
Notice that the first parameter is not the character being searched and that in this answer the indexing code is written as
c_digit(0'0,0).
where the first parameter is the character being searched and the second parameter is the result.
So change this
op_token(Token), !
to this
[S], { special_character_indexed(S,Token) }
with indexed clauses as
special_character_indexed( ';' ,tokSemicolon).
Version: 3
:- set_prolog_flag(double_quotes,chars).
% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)
Version 3 gives a slightly better but consistently better result than Version 2.
Lastly just changing double_quotes flag to atom
as noted in a comment by AntonDanilov
Version: 4
:- set_prolog_flag(double_quotes,atom).
% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)
Version 4 is almost the same as Version 3.
Just looking at CPU numbers, using indexing is faster, e.g. (Version: 1) 151.875
vs (Version: 3) 74.547
这篇关于如何提高词法效率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!