我正在尝试编辑文件格式,对不,它看起来像这样:


  
    >集群0
    L07510
    >集群1
    AF480591
    AY457083
    >集群2
    M88154
    >集群3
    CP000924
    L09161
    >集群4
    AY742307
    >集群5
    L09163
    L09162
    >集群6
    AF321086
    >集群7
    DQ666175
    >集群8
    DQ288691
  


我想在python中写一些东西,将通过每行,停在说“ Cluster x”(x为数字)的行,然后将该数字添加到其后的任何行中。然后,当到达新的“集群x”时,它将以新的x值再次开始。

所以它看起来像这样:


  
    >集群0
    0 L07510
    >集群1
    1个AF480591
    1 AY457083
    >集群2
    2 M88154
    >集群3
    3个CP000924
    3 L09161
    >集群4
    4 AY742307
    >集群5
    5 L09163
    5 L09162
    >集群6
    6个AF321086
    >集群7
    7 DQ666175
    >集群8
    8 DQ288691
  


我当时想我可以使用regex,搜索">Cluster x"(正则表达式看起来像这样吗?我只是不确定如何实际写这个。任何帮助将不胜感激!

最佳答案

经过测试

# If you're on a POSIX compliant system, and this script is marked as
# executable, the following line will make this file be automatically
# run by the Python interpreter rather than interpreted as a shell script
#!/usr/bin/env python

# We need the sys module to read arguments from the terminal
import sys

# Open the input file, default mode is 'r', readonly, which is a safe default
infile = open(sys.argv[1])

# Prepare a variable for the cluster number to be used within the loop
cluster = ''

# loop through all lines in the file, but first set up a list comprehension
# that strips the newline character off the line for each line that is read
for line in (line.strip() for line in infile):
    if line.startswith('>'):
        # string.split() splits on whitespace by default
        # we want the cluster number at index 1
        cluster = line.split()[1]

        # output this line to stdout unmodified
        print line

    else:
        # output any other line modified by adding the cluster number
        print cluster + ' ' + line


用法

$ python cluster_format.py input.txt > output.txt

10-08 12:52