本文介绍了窥探Python中的Popen管道流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:
Linux上的Python 2.6.6. DNA序列分析流程的第一部分.
我想从挂载的远程存储(LAN)中读取可能是gzip的文件,如果已将其压缩;将其解压缩到流中(即使用gunzip FILENAME -c),如果流(文件)的第一个字符为"@",则将整个流路由到一个接受标准输入输入的过滤程序中,否则直接将其通过管道传递给文件在本地磁盘上.我想尽量减少从远程存储读取/查找文件的次数(单次通过文件应该不是不可能吗?).

Background:
Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline.
I want to read a possibly gzipped file from a mounted remote storage (LAN) and if it is gzipped; gunzip it to a stream (i.e. using gunzip FILENAME -c) and if the first character of the stream (file) is "@", route that entire stream into a filtering program that takes input on standard input, otherwise just pipe it directly to a file on local disk. I'd like to minimize the number of file reads/seeks from remote storage (just a single pass through the file shouldn't be impossible?).

示例输入文件的内容,前四行对应于FASTQ格式的一条记录:

Contents of an example input file, first four lines corresponding to one record in FASTQ format:

@I328_1_FC30MD2AAXX:8:1:1719:1113/1
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG
+I328_1_FC30MD2AAXX:8:1:1719:1113/1
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhahhhhhhfShhhYhhQhh]hhhhffhU\UhYWc

不应通过管道传递到过滤程序中的文件包含如下所示的记录(前两行对应于FASTA格式的一条记录):

Files that should not be piped into the filtering program contain records that look like this (first two lines corresponding to one record in FASTA format):

>I328_1_FC30MD2AAXX:8:1:1719:1113/1
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG

有些人花了半伪代码来可视化我想做的事情(我知道用我编写它的方式是不可能的).我希望这是有道理的:

Some made up semi-pseudo code effort to visualize what I want to do (I know this isn't possible the way I've written it). I hope it makes some sense:

if gzipped:
    gunzip = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)
    if gunzip.stdout.peek(1) == "@": # This isn't possible
        fastq = True
    else:
        fastq = False
if fastq:
    filter = Popen(["filter", "localstorage/outputfile.fastq"], stdin=gunzip.stdout).communicate()
else:
    # Send the gunzipped stream to another file

忽略这样的事实,即代码不会像我在此处编写的那样运行,并且我没有错误处理等,所有这些已经在我的其他代码中了.我只是想寻求帮助,以窥视流或找到解决方法.如果您可以gunzip.stdout.peek(1),我会很棒,但是我意识到那是不可能的.

Disregard the fact that the code won't run like I've written it here and that I have no error handling etc, all that is already in my other code. I just want help with peeking into the stream or finding a way around that. I would be great if you could gunzip.stdout.peek(1) but I realize that's not possible.

到目前为止我已经尝试过:
我想到了subprocess.Popen可能会帮助我实现这一目标,并且我尝试了许多不同的想法,其中包括尝试使用某种io.BufferedRandom()对象将流写入其中,但我不知道该怎么做会工作.我知道流是不可搜索的,但是可能的解决方法是读取gunzip-stream的第一个字符,然后创建一个新的流,在该流中,您首先根据文件内容输入"@"或>",然后填充其余内容将gunzip.stdout-stream添加到新流中.然后,这个新流将被馈送到过滤器的Popen stdin中.

What I've tried so far:
I figured subprocess.Popen might help me achieve this, and I've tried a lot of different ideas, amongst others trying to use some kind of io.BufferedRandom() object to write the stream to but I can't figure out how that would work. I know streams are non-seekable but maybe a workaround might be to read the first character of the gunzip-stream and then create a new stream where you first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream. This new stream would then be fed into filter's Popen stdin.

请注意,文件大小可能比可用内存大几倍.我不想对远程存储中的源文件进行一次以上的读取,也不需要进行不必要的文件访问.

Note that the file sizes might be several times larger than available memory. I do not want to perform more than one single read of the source file from remote storage and no unnecessary file accessing.

任何想法都欢迎!请问我问题,以便我弄清楚是否不够清楚.

Any ideas are welcome! Please ask me questions so I can clarify if I didn't make it clear enough.

推荐答案

这是您的实现,首先根据文件内容输入"@"或>",然后填充其余的gunzip.stdout -流到新的流提案中.我只测试了测试的本地文件分支,但这足以说明这个概念.

Here is an implementation of your first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream proposal. I only tested the local-file branch of the test, but it should be enough to demonstrate the concept.

if gzipped:
    source = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)
else:
    source = Popen(["cat", "remotestorage/file"], stdout=PIPE)
firstchar = source.stdout.read(1)
# "unread" the char we've just read
source = Popen([r"(printf '\x%02x' && cat)" % ord(firstchar)],
               shell=True, stdin=source.stdout, stdout=PIPE)

# Now feed the output to a filter or to a local file.
flocal = None
try:
    if firstchar == "@":
        filter = Popen(["filter", "localstorage/outputfile.fastq"],
                       stdin=source.stdout)
    else:
        flocal = open('localstorage/outputfile.stream', 'w')
        filter = Popen(["cat"], stdin=source.stdout, stdout=flocal)
    filter.communicate()
finally:
    if flocal is not None:
        flocal.close()

这个想法是从源命令的输出中读取一个字符,然后使用(printf '\xhh' && cat)重新创建原始输出,从而有效地实现了窥视.替换流将shell=True指定为Popen,将其留在外壳中,并使用cat进行繁重的工作.数据始终保留在管道中,永远不会被完全读入内存.请注意,仅对实现未读取偷看字节的Popen的单个调用(而不是涉及用户提供的文件名的调用)请求外壳程序服务.即使在这一点上,该字节也会转义为十六进制,以确保在调用printf时外壳程序不会破坏它.

The idea is to read a single character from the source command's output, and then recreate the original output using (printf '\xhh' && cat), effectively implementing the peek. The replacement stream specifies shell=True to Popen, leaving it to the shell and cat to do the heavy lifting. The data remains in the pipeline at all times, never getting entirely read into memory. Note that services of the shell are only requested for the single call to Popen that implements unreading the peeked byte, not to the calls that involve of user-supplied file names. Even at that point, the byte is escaped to hex to make sure that the shell does not mangle it when invoking printf.

可以进一步清理代码,以实现名为peek的实际函数,该函数返回被查看的内容和替换的new_source.

The code could be further cleaned up to implement an actual function named peek that returns the peeked contents and a replacement new_source.

这篇关于窥探Python中的Popen管道流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 17:22
查看更多