问题描述
我正在解析由一系列行组成的相当简单的文件格式,每行都有一些空格分隔的字段,如下所示:
l 0x9823 1s 0x1111 30x1111 12⋮
我正在使用 SWI-Prolog.这是我目前拥有的 DCG:
:- 咨询(library(pure_input)).load_trace(文件名,跟踪):-短语来自文件(trace_file_phrase(跟踪),文件名).trace_file_phrase([]) -->[].trace_file_phrase([T|Ts]) -->trace_phrase(T)、trace_file_phrase(Ts).trace_phrase(access(Type, Address,SinceLast)) -->访问类型(类型),空间,地址(地址),空间,nat(SinceLast), 换行符.access_type(load) -->我".access_type(store) -->s".地址(号码)-->0x",十六进制(数字).hexdigit(N) -->位数(N).hexdigit(10) -->一种".hexdigit(11) -->乙".hexdigit(12) -->C".hexdigit(13) -->d".hexdigit(14) -->电子".hexdigit(15) -->F".十六进制(N) -->十六进制数字(D),十六进制(D,N).hexnum(N, N) -->[].hexnum(A, N) -->hexdigit(D), { A1 是 A*16 + D }, hexnum(A1, N).换行符 -->\n".空间 -->"".%% 以下两部作品由 Lars Mans 提供%% https://stackoverflow.com/questions/3279822/parsing-numbers-with-multiple-digits-in-prolog数字(0) -->0".数字(1) -->1".数字(2) -->2".数字(3) -->3".数字(4) -->4".数字(5) -->5".数字(6) -->6".数字(7) -->7".数字(8) -->8".数字(9) -->9".nat(N) -->数字(D),nat(D,N).nat(N,N) -->[].nat(A,N) -->digit(D), { A1 是 A*10 + D }, nat(A1, N).
正如评论中提到的,我从 解析数字中抄袭了数字处理Prolog 中的多个数字.
我遇到的问题是其中一些文件很大,例如大约 5-10 MB.SWI-Prolog 中的默认堆栈对此是不够的,解析这些文件需要花费大量时间,大约为 5-15 秒.我有几个关于这种情况的问题:
- 这段代码的效率问题在哪里?我认为是在
trace_file_phrase//1
或nat//1
中,但这些只是直觉. - 如果问题是列表,有没有比这更好的方法来处理带有 DCG 的列表?
- 一般来说,如何诊断和处理此类 DCG 的性能问题?
作为一般性评论,您可以在 library(pio)
.干净地使用它的方法也是:
:- use_module(library(pio)).
你的例子太复杂了,所以我只会考虑一个稍微简单的例子,一个换行符分隔的数字列表:
nats([]) --> [].nats([N|Ns]) --> nat(N),换行符,nats(Ns).那么,您如何有效地进行测试?(这是您的问题 3)library(pio)
的基本点是您可以使用常规 DCG 进行文件处理.但是对于小型测试,您仍然可以使用简单的 phrase/2
.所以我这样做:
你看到 ;
提示了吗?这意味着:Prolog 无法决定是否可以计算进一步的答案 - 因此它留下了一个或多个选择点.而且只有一位数你可以想象事情会如何堆积.
让我们深入挖掘:
?- 短语(数字(D),1").D = 1 ;错误的.又是邪恶的;
!为了使这项工作发挥作用,一切都必须是确定的.有以下三种方法:
使用削减(并失去你的灵魂)
祝你好运 - 最好的似乎是在重复元素之后:
trace_file_phrase([]) --> [].trace_file_phrase([T|Ts]) -->trace_phrase(T),!,%丑陋,但是......trace_file_phrase(Ts).(这应该回答问题 1)
但是,请稍等!这个 !
有什么不好?只要对trace_phrase//1
有完全一个答案,事情就是完美的.只是,如果有更多的答案(或实际的解决方案),切割可能会删除宝贵的答案.你怎么知道,如果有更多的解决方案?好吧,你没有.你不会看到它们,因为它们已经被切掉了.
call_semidet/1
这是一种确保不会发生这种情况的方法.这仅适用于可以调用两次而没有任何效果的无副作用目标:
call_semidet(目标):-( call_nth(Goal, 2)-> 抛出(错误(模式错误(半场,目标),_));一次(目标)).这使用 call_nth/2
,在另一篇文章中定义.(作为一种优化,当没有选择点打开时,实现可以避免调用 Goal
两次......)只是为了说明它是如何工作的:
所以它可以有效地确定你的小语法!因此无需重新制定任何内容!
现在缺少的是将其整合到语法中.您可以非常低级地执行此操作,或者使用 库(lambda).
短语半程(NT)-->呼叫(S0^S^call_semidet(短语(NT,S0,S))).请注意,在这种非常特殊的情况下,我们不使用 \
进行重命名.
利用索引
最后,一种非常费力但干净的方法是重写所有内容以更好地从索引中获利(并且可能有助于改善一般的索引......)但这是一条漫长的道路.刚开始:
数字(D) --> [C],{c_digit(C,D)}.c_digit(0'0,0).c_digit(0'1,1).c_digit(0'2,2).c_digit(0'3,3).c_digit(0'4,4).c_digit(0'5,5).c_digit(0'6,6).c_digit(0'7,7).c_digit(0'8,8).c_digit(0'9,9).现在给你:
?- 短语(数字(D),1").D = 1.但是您还有另一个不确定性来源,这与您定义语法的方式有关.在 nat//2
中你可以看到:
第一条规则始终适用,即 "1234\n"
将被解析为 "1"
"12"
>"123"
"1234"
只有下面的 newline//0
意识到它就足够了最后一个 —然后坚持下去.
你可以为此重写一些东西,但是代码不再是你喜欢的那种纯粹的小规范,不是吗?好吧,也许未来情况会有所改善.
例如SWI 中的索引比以前好得多,也许这里的情况也在发展......
library(pio)
的目的是让这个过程开始.将此与 Haskell 进行比较我们离 interact
效率还很远!但是没有固有成本:
和 grep 一样高效 —空间方面.也就是说,它在恒定空间中运行.所以希望未来更多的代码能够运行得更好.
顺便说一句,library(pio)
确实已经对效率产生了影响:GC 阶段得到了显着改进,与 Wadler 修复一些空间泄漏的方式非常相似.四分之一世纪前的纸.事情发展...
I'm parsing a fairly simple file format consisting of a series of lines, each line having some space separated fields, that looks like this:
l 0x9823 1
s 0x1111 3
l 0x1111 12
⋮
I'm using SWI-Prolog. This is the DCG I have so far:
:- consult(library(pure_input)).
load_trace(Filename, Traces) :-
phrase_from_file(trace_file_phrase(Traces), Filename).
trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) --> trace_phrase(T), trace_file_phrase(Ts).
trace_phrase(access(Type, Address, SinceLast)) -->
access_type(Type), space,
address(Address), space,
nat(SinceLast), newline.
access_type(load) --> "l".
access_type(store) --> "s".
address(Number) --> "0x", hexnum(Number).
hexdigit(N) --> digit(N).
hexdigit(10) --> "a". hexdigit(11) --> "b". hexdigit(12) --> "c".
hexdigit(13) --> "d". hexdigit(14) --> "e". hexdigit(15) --> "f".
hexnum(N) --> hexdigit(D), hexnum(D, N).
hexnum(N, N) --> [].
hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N).
newline --> "\n".
space --> " ".
%% the following two productions are courtesy of Lars Mans at
%% https://stackoverflow.com/questions/3279822/parsing-numbers-with-multiple-digits-in-prolog
digit(0) --> "0". digit(1) --> "1". digit(2) --> "2".
digit(3) --> "3". digit(4) --> "4". digit(5) --> "5".
digit(6) --> "6". digit(7) --> "7". digit(8) --> "8".
digit(9) --> "9".
nat(N) --> digit(D), nat(D,N).
nat(N,N) --> [].
nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N).
As mentioned in the comment, I cribbed the number handling from Parsing numbers with multiple digits in Prolog.
The problem I'm running into is some of these files are large, like, on the order of 5-10 MB. The default stack in SWI-Prolog is insufficient for this, and parsing these files is taking substantial time, on the order of 5-15 seconds. I have several questions about this situation:
- Where is the efficiency problem in this code? I think it's either in
trace_file_phrase//1
ornat//1
but these are just hunches. - If the problem is lists, is there a better way to handle lists with DCGs than this?
- How does one, in general, diagnose and treat performance problems with DCGs such as this?
As a general remark you will find more on SO about it under the name library(pio)
. Also the way to use it cleanly is rather:
:- use_module(library(pio)).
Your example is way too complex, so I will only consider a slightly simpler case, a newline separated list of numbers:
nats([]) --> []. nats([N|Ns]) --> nat(N), newline, nats(Ns).
So, how can you test this effectively? (That's your Question 3) The basic point of library(pio)
is that you can use regular DCGs for file processing. But for testing in the small you can still use the simple phrase/2
. So I do:
?- phrase(nats(Ns),"1\n"). Ns = [1] ; false.
Did you see the ;
prompt? That means: Prolog was not able to decide whether or not further answers might be computed - so it leaves one or more choice-points open. And that only for a single digit You can imagine how things will pile up.
Let's dig deeper:
?- phrase(digit(D),"1"). D = 1 ; false.
Again the evil ;
! In order to make this work, everything would have to be determinate. There are three ways to this:
Use cuts (and lose your soul)
I wish you luck - the best seems to be just after the repeating element:
trace_file_phrase([]) --> []. trace_file_phrase([T|Ts]) --> trace_phrase(T), !, % ugly, but... trace_file_phrase(Ts).
(This should answer Question 1)
But, hold a minute! What is so bad about this !
? As long, as there is exactly one answer to trace_phrase//1
things are perfect. It is only, if there are more answers (or actually solutions), that the cut might remove precious answers. How do you know, if there are more solutions? Well, you don't. And you won't see them as they have been cut away already.
call_semidet/1
Here is a way to ensure that this does not happen. This works only for side-effect free goals that can be called twice without any effect:
call_semidet(Goal) :- ( call_nth(Goal, 2) -> throw(error(mode_error(semidet,Goal),_)) ; once(Goal) ).
This uses call_nth/2
, as defined in another post. (As an optimization, the implementation could avoid calling Goal
twice when there is no choice-point open...) Just to make clear, how it works:
?- phrase(nat(N),"1234"). N = 1234 ; false. ?- call_semidet(phrase(nat(N),"1234")). N = 1234. ?- call_semidet((X=1;X=2)). ERROR: Unknown error term: mode_error(semidet, (2=1;2=2))
So it makes your little grammar effectively determinate! There is thus no need to reformulate anything!
What is lacking now is some integration of this into the grammar. You can do this very low-level, or rather cleanly using library(lambda)
.
phrase_semidet(NT) --> call(S0^S^call_semidet(phrase(NT,S0,S))).
Note that in this very particular case we do not use the \
for renaming.
trace_file_phrase([]) --> []. trace_file_phrase([T|Ts]) --> phrase_semidet(trace_phrase(T)), trace_file_phrase(Ts).
Exploit indexing
Finally, a very laborious but clean way would be to rewrite everything to profit better from indexing (and maybe help to improve indexing in general...) But this is a long road. Just to start with:
digit(D) --> [C], {c_digit(C,D)}. c_digit(0'0,0). c_digit(0'1,1). c_digit(0'2,2). c_digit(0'3,3). c_digit(0'4,4). c_digit(0'5,5). c_digit(0'6,6). c_digit(0'7,7). c_digit(0'8,8). c_digit(0'9,9).
This gives you now:
?- phrase(digit(D),"1"). D = 1.
But you have another source of nondeterminism which is rather due to the way you define the grammar. In nat//2
you see it:
nat(N,N) --> []. nat(A,N) --> digit(D), ... .
The first rule always applies, that is, "1234\n"
will be parsed as "1"
"12"
"123"
"1234"
only the following newline//0
realizes that it would suffice go for the last — and then stick to it.
You can rewrite things for that, but then, the code is no longer the pure little spec you liked, isn't it? Well, maybe things might improve in the future.
E.g. indexing is much better in SWI than it used to be, maybe also things evolve here too....
The intention of library(pio)
was to get this process started. Compare this to Haskell — we are far away from interact
efficiency-wise! But there is no inherent cost:
... --> [] | [_], ... . ?- phrase_from_file((...,"searchstring",...),fichier).
is just as efficient as grep — space-wise. That is, it runs in constant space. So hopefully more code will run better in the future.
Edit: BTW, library(pio)
did already have an impact efficiency-wise: GC phases were improved significantly, very much in the same manner as Wadler's Fixing some space leak – paper a quarter century ago. Things evolve ...
这篇关于Prolog DCG 语法规则中的堆栈溢出:如何有效或懒惰地处理大列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!