python - 使用 Python 从多个文本文件中提取列

我有一个文件夹，里面有 5 个文本文件，这些文件与各个站点有关——

标题的格式如下:

Rockspring_18_SW.417712.WRFc36.ET.2000-2050.txt

Rockspring_18_SW.417712.WRFc36.RAIN.2000-2050.txt

WICA.399347.WRFc36.ET.2000-2050.txt

WICA.399347.WRFc36.RAIN.2000-2050.txt

所以，基本上文件名遵循以下格式 -
(站点名称)。(站点编号)。(WRFc36)。(一些变量)。(2000-2050.txt

这些文本文件中的每一个都有类似的格式，没有标题行:年月日值(每个文本文件中包含 ~18500 行)

我希望 Python 搜索相似的文件名(站点名称和站点编号匹配)，并从其中一个文件中挑选出第一到第三列数据并将其粘贴到一个新的 txt 文件中。我还想复制并粘贴站点(rain、et 等)的每个变量的第 4 列，并将它们以特定顺序粘贴到新文件中。

我知道如何使用 csv 模块(并为空格分隔符定义新方言)从所有文件中获取数据并打印到一个新的文本文件，但我不确定如何为每个站点自动创建一个新文件名称/编号，并确保我的变量以正确的顺序绘制--

我想使用的输出是每个站点的一个文本文件(不是 5 个)，格式如下(年、月、日、变量 1、变量 2、变量 3、变量 4、变量 5)，大约 18500 行...

我确定我在这里查看了一些非常简单的东西......这似乎是非常基本的......但是 - 任何帮助将不胜感激!

更新

========

我已经更新了代码以反射(reflect)下面的评论。
http://codepad.org/3mQEM75e

从集合导入 defaultdict
导入全局
导入 csv

#Create dictionary of lists--   [A] = [Afilename1, Afilename2, Afilename3...]
#                               [B] = [Bfilename1, Bfilename2, Bfilename3...]
def get_site_files():
    sites = defaultdict(list)
    #to start, I have a bunch of files in this format ---
    #"site name(unique)"."site num(unique)"."WRFc36"."Variable(5 for each site name)"."2000-2050"
    for fname in glob.glob("*.txt"):
        #split name at every instance of "."
        parts = fname.split(".")
        #check to make sure i only use the proper files-- having 6 parts to name and having WRFc36 as 3rd part
        if len(parts)==6 and parts[2]=='WRFc36':
            #Make sure site name is the full unique identifier, the first and second "parts"
            sites[parts[0]+"."+parts[1]].append(fname)
    return sites

#hardcode the variables for method 2, below
Var=["TAVE","RAIN","SMOIS_INST","ET","SFROFF"]

def main():
    for site_name, files in get_site_files().iteritems():
        print "Working on *****"+site_name+"*****"
####Method 1- I'd like to not hardcode in my variables (as in method 2), so I can use this script in other applications.
        for filename in files:
            reader = csv.reader(open(filename, "rb"))
            WriteFile = csv.writer(open("XX_"+site_name+"_combined.txt","wb"))
            for row in reader:
                row = reader.next()
####Method 2 works (mostly), but skips a LOT of random lines of first file, and doesn't utilize the functionality built into my dictionary of lists...
##        reader0 = csv.reader(open(site_name+".WRFc36."+Var[0]+".2000-2050.txt", "rb"))    #I'd like to copy ALL columns from the first file
##        reader1 = csv.reader(open(site_name+".WRFc36."+Var[1]+".2000-2050.txt", "rb"))    #    and just the fourth column from all the rest of the files
##        reader2 = csv.reader(open(site_name+".WRFc36."+Var[2]+".2000-2050.txt", "rb"))    #    (the columns 1-3 are the same for all files)
##        reader3 = csv.reader(open(site_name+".WRFc36."+Var[3]+".2000-2050.txt", "rb"))
##        reader4 = csv.reader(open(site_name+".WRFc36."+Var[4]+".2000-2050.txt", "rb"))
##        WriteFile = csv.writer(open("XX_"+site_name+"_COMBINED.txt", "wb"))               #creates new command to write a text file
##
##        for row in reader0:
##            row  = reader0.next()
##            row1 = reader1.next()
##            row2 = reader2.next()
##            row3 = reader3.next()
##            row4 = reader4.next()
##            WriteFile.writerow(row + row1 + row2 + row3 + row4)
##        print "***finished with site***"

if __name__=="__main__":
    main()

最佳答案

这是一种更简单的方法来遍历按站点分组的文件。

from collections import defaultdict
import glob

def get_site_files():
    sites = defaultdict(list)
    for fname in glob.glob('*.txt'):
        parts = fname.split('.')
        if len(parts)==6 and parts[2]=='WRFc36':
            sites[parts[0]].append(fname)
    return sites

def main():
    for site,files in get_site_files().iteritems():
        # you need to better explain what you are trying to do here!
        print site, files

if __name__=="__main__":
    main()

我仍然不明白您的剪切和粘贴列 - 您需要更清楚地解释您要完成的工作。

关于python - 使用 Python 从多个文本文件中提取列，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/11942683/