问题描述
我试图使用 Erlang获取的平均长度即可。一个fasta文件看起来像这样
> title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
个TITLE2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
个TITLE3
ATCGATCGCATCGAT(...)
我尝试使用以下 Erlang 代码回答此问题:
-module(高尔夫)。
-export([test / 0])。
line([],{Sequences,Total}) - > {序列,共有};
line(>++ Rest,{Sequences,Total}) - > {序列+ 1,总};
line(L,{Sequences,Total}) - > {序列,总共+字符串:LEN(字符串:带材(L))}。
scanLines(S,序列,总计) - >
case io:get_line(S,'')
eof - > {序列,共有};
{error,_} - > {Sequences,Total};
行 - > {S2,T2} = line(Line,{Sequences,Total}),scanLines(S,S2,T2)
end。
test() - >
{Sequences,Total} = scanLines(standard_io,0,0),
io:format(〜p\\\
,[Total /(1.0 * Sequences)]),
停()。
编译/执行:
erlc golf.erl
erl -noshell -s高尔夫测试< sequence.fasta
563.16
此代码似乎适用于一个小的fasta文件,但它需要几个小时来解析一个较大的(> 100Mo)。为什么我是一个Erlang新手,可以改进这个代码吗?
如果你需要真正快速的IO,那么你必须做比平常更棘手一点。
-module(g)。
-export([s / 0])。
s() - >
P = open_port({fd,0,1},[in,binary,{line,256}]),
r(P,0,0),
halt()。
r(P,C,L) - >
接收
{P,{data,{eol,< $>:8,_ / binary>>}}} - >
r(P,C + 1,L);
{P,{data,{eol,Line}}} - >
r(P,C,L + size(Line));
{'EXIT',P,normal} - >
io:format(〜p〜n,[L / C])
end。
它是我知道的最快的IO,但注意 -noshell -noinput
。
编译就像 erlc + native +{hipe,[o3]}g.erl
但使用 -smp disable
erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd / home / hynek /下载@option native @option'{hipe,[o3]}'@files g.erl
并运行:
time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -sgs< uniprot_sprot.fasta
352.6697028442464
real 0m3.241s
用户0m3.060s
sys 0m0.124s
使用 -smp启用
但本机需要:
$ erlc + native +{hipe,[o3]}g.erl
$ time erl -noshell -mode最小-boot start_clean -noinput -sg s< uniprot_sprot $。 $ p>
字节代码,但带有 -smp disable
(几乎与native一样,因为大部分工作都是在端口完成的) :
$ erlc g.erl
$ time erl -smp disable -noshell -mode最小-boot start_clean -noinput -sg s< uniprot_sprot.fasta
352.6697028442464
real 0m3.565s
用户0m3.436s
sys 0m0.104s
仅用于smp完整的字节码:
$ time erl -noshell -mode minimal -boot start_clean -noinput -sg s< uniprot_sprot.fasta
352.6697028442464
real 0m5.433s
用户0m5.236s
sys 0m0.128s
比较 给我错误的答案,并且需要更多的相同的HW:
$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd / home / hynek /下载@option native @option'{hipe,[o3]}'@files golf.erl
./golf.erl:5:Warning:variable'休息'未使用
$ time erl -smp disable -noshell -mode最小-s高尔夫测试
359.04679841439776
real 0m17.569s
用户0m16.749s
sys 0m0.664s
编辑:我已经看过 uniprot_sprot.f asta
,我有点惊讶。它是3824397行和232MB。这意味着 -smp禁用
版本可以处理每秒118万个文本行(71MB / s的线性IO)。
I'm trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
I tried to answser this question using the following Erlang code:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
Compilation/Execution:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
this code seems to work fine for a small fasta file but it takes hours to parse a larger one (>100Mo). Why ? I'm an Erlang newbie, can you please improve this code ?
解决方案 If you need really fast IO then you have to do little bit more trickery than usual.
-module(g).
-export([s/0]).
s()->
P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
r(P, 0, 0),
halt().
r(P, C, L) ->
receive
{P, {data, {eol, <<$>:8, _/binary>>}}} ->
r(P, C+1, L);
{P, {data, {eol, Line}}} ->
r(P, C, L + size(Line));
{'EXIT', P, normal} ->
io:format("~p~n",[L/C])
end.
It is fastest IO as I know but note -noshell -noinput
.Compile just like erlc +native +"{hipe, [o3]}" g.erl
but with -smp disable
erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
and run:
time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464
real 0m3.241s
user 0m3.060s
sys 0m0.124s
With -smp enable
but native it takes:
$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m5.103s
user 0m4.944s
sys 0m0.112s
Byte code but with -smp disable
(almost in par with native because most of work is done in port!):
$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m3.565s
user 0m3.436s
sys 0m0.104s
Just for completeness byte code with smp:
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m5.433s
user 0m5.236s
sys 0m0.128s
For comparison sarnold version gives me wrong answer and takes more on same HW:
$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776
real 0m17.569s
user 0m16.749s
sys 0m0.664s
EDIT: I have looked at characteristics of uniprot_sprot.fasta
and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled
version can handle 1.18 million text lines per second (71MB/s in line oriented IO).
这篇关于“fasta文件中的序列的平均长度”:您可以改进这个Erlang代码吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!