在使用DCG解析3 GB的大文件时,效率至关重要。
我的词法分析器的当前版本主要使用or或谓词;/2,但我读到索引可以提供帮助。
有人可以举一个使用索引进行词法化的例子,并可能解释一下它如何提高效率吗?
细节
注意:在将源代码处理到此问题之前,我更改了一些名称。如果您发现错误,请随时在此处进行编辑或给我留言,我们将很乐意解决。
目前,我的词法分析器/标记器(基于mzapotoczny/prolog-interpreter parser.pl)是这个
% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).
lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).
white_space -->
[Char], { code_type(Char, space) }, !, white_space.
white_space -->
"--", whole_line, !, white_space.
white_space -->
[].
whole_line --> "\n", !.
whole_line --> [_], whole_line.
digit(D) -->
[D],
{ code_type(D, digit) }.
digits([D|T]) -->
digit(D),
!,
digits(T).
digits([]) -->
[].
number(D, N) -->
digits(Ds),
{ number_chars(N, [D|Ds]) }.
letter(L) -->
[L], { code_type(L, alpha) }.
alphanum([A|T]) -->
[A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
[].
alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).
identifier(L, Id) -->
alphanum(As),
{ atom_codes(Id, [L|As]) }.
这是用于开发和测试的一些辅助谓词。
read_file_for_lexing_and_user_review(Path) :-
open(Path,read,Input),
read_input_for_user_review(Input), !,
close(Input).
read_file_for_lexing_and_performance(Path,Limit) :-
open(Path,read,Input),
read_input_for_performance(Input,0,Limit), !,
close(Input).
read_input(Input) :-
at_end_of_stream(Input).
read_input(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
read_input(Input).
read_input_for_user_review(Input) :-
at_end_of_stream(Input).
read_input_for_user_review(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line_for_user_review(Line),
nl,
print('Press spacebar to continue or any other key to exit: '),
get_single_char(Key),
process_user_continue_or_exit_key(Key,Input).
read_input_for_performance(Input,Count,Limit) :-
Count >= Limit.
read_input_for_performance(Input,_,_) :-
at_end_of_stream(Input).
read_input_for_performance(Input,Count0,Limit) :-
% print(Count0),
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
Count is Count0 + 1,
read_input_for_performance(Input,Count,Limit).
process_user_continue_or_exit_key(32,Input) :- % space bar
nl, nl,
read_input_for_user_review(Input).
process_user_continue_or_exit_key(Key) :-
Key \= 32.
lex_line_for_user_review(Line) :-
lex_line(Line,TokList),
print(Line),
nl,
print(TokList),
nl.
lex_line(Line,TokList) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
lex_line(Line) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).
read_user_input_for_lexing_and_user_review :-
print('Enter a line to parse or just Enter to exit: '),
nl,
read_string(user, "\n", "\r", _, String),
nl,
lex_line_for_user_review(String),
nl,
continue_user_input_for_lexing_and_user_review(String).
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,N),
N > 0,
read_user_input_for_lexing_and_user_review.
continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,0).
read_user_input_for_lexing_and_user_review/0
允许用户在终端上输入字符串以进行词法化并查看 token 。read_file_for_lexing_and_user_review/1
读取文件以进行词法分析,并一次查看每一行的标记。read_file_for_lexing_and_performance/2
读取要进行词法化的文件,但要限制词法的行数。这用于收集基本性能统计信息以衡量效率。打算与time/1一起使用。 最佳答案
解决方案:
您应该替换以下内容:
lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).
和
lexer(Tokens) -->
white_space,
(
(
op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
;
digit(D), !, number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
;
[],
{ Tokens = [] }
).
%%%
op_token(tokColon) --> ";".
op_token(tokLParen) --> "(".
op_token(tokRParen) --> ")".
op_token(tokLMusta) --> "{".
op_token(tokRMusta) --> "}".
op_token(tokBackSlash) --> "\\".
op_token(tokImpl) --> "->".
op_token(tokPlus) --> "+".
op_token(tokMinus) --> "-".
op_token(tokTimes) --> "*".
op_token(tokEqual) --> "=".
op_token(tokLt) --> "<".
op_token(tokGt) --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod) --> ".".
op_token(tokSlash) --> "/".
op_token(tokComma) --> ",".
op_token(tokSemicolon) --> ";".
由Guy Coder编辑
我使用问题中发布的示例数据将其运行到一个列表中,其中列表中的每个项目都是数据中转换为字符代码的一行。然后在列表中的每个项目上使用time/1调用lexer,并对该列表重复测试10000次。将数据加载到列表中并在time/1之前转换为字符代码的原因是为了使这些进程不会使结果倾斜。将这些运行中的每一次重复5次以获取数据的一致性。
在下面的以下运行中,针对所有不同版本的词法分析器均进行了扩展,以涵盖所有7位ASCII字符,这大大增加了特殊字符的字符数。
用于以下目的的Prolog版本是SWI-Prolog 8.0。
对于问题中的版本。
Version: 1
:- set_prolog_flag(double_quotes,chars).
% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)
对于以上在此答案中发布的版本
Version: 2
:- set_prolog_flag(double_quotes,chars).
% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)
通过使用版本1中的索引,版本2进行了重大改进。
在对代码进行进一步研究时,使用DCG来查看
op_token
,它是listing/1并具有两个用于隐式传递状态表示的隐藏变量,结果显示:op_token(tokUnderscore,['_'|A], A).
注意,第一个参数不是要搜索的字符,在此answer中,索引代码写为
c_digit(0'0,0).
其中第一个参数是要搜索的字符,第二个参数是结果。
所以改变这个
op_token(Token), !
对此
[S], { special_character_indexed(S,Token) }
带有索引子句为
special_character_indexed( ';' ,tokSemicolon).
版本:3
:- set_prolog_flag(double_quotes,chars).
% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)
第3版比第2版的效果略好一些,但始终如一。
最后,正如AntonDanilov的评论中所述,将double_quotes标志更改为
atom
Version: 4
:- set_prolog_flag(double_quotes,atom).
% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)
版本4与版本3几乎相同。
仅查看CPU编号,使用索引的速度就更快,例如(版本:1)
151.875
与(版本:3)74.547
关于performance - 如何提高词汇处理效率?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54259696/