我有一个大文件,想以某种方式进行格式化。文件输入示例:

DVL1    03220   NP_004412.2 VANGL2  02758   Q9ULK5  in vitro    12490194
PAX3    09421   NP_852124.1 MEOX2   02760   NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254.1  in vitro;in vivo    15195140


这就是我希望它成为的方式:

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254


总结一下:


如果该行有1个点,则删除该点及其后的数字并添加\ t,因此输出行将仅具有6个制表符分隔的值
如果该行有2个点,则会删除这些点及其后的数字并添加\ t,因此输出行将仅具有6个制表符分隔的值
如果该行没有点,则保持前6个制表符分隔的值


我的想法目前是这样的:

for line in infile:
    if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
        transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
        columns = transformed_line.split('\t')
        outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
    else:
        columns = line.split('\t')
        outfile.write('\t'.join(columns[:5]) + '\n') # this is fine


希望我能自己解释一下。
谢谢你们的努力。

最佳答案

您可以尝试这样的事情:

    with open('data1.txt') as f:
        for line in f:
            line=line.split()[:6]
            line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)  #if an element has '.' then
                                                                         #remove that dot else keep the element as it is
            print('\t'.join(line))


输出:

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254


编辑:

正如@mgilson所建议的,可以用line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)替换行line=map(lambda x:x.split('.')[0],line)

关于python - 使用Python区分具有一个点的线和具有两个点的线,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/11474528/

10-11 19:37