我有一个超大的csv.gzip文件,它有59mill行。我想根据特定条件筛选该文件中的特定行,并将所有这些行放入一个新的主csv文件中。截至目前,我将gzip文件分为118个较小的csv文件,并将它们保存在我的计算机上。我是用下面的代码做的:

import pandas as pd
num = 0
df = pd.read_csv('google-us-data.csv.gz', header = None,
compression =   'gzip', chunksize = 500000,
names = ['a','b','c','d','e','f','g','h','i','j','k','l','m'],
error_bad_lines = False, warn_bad_lines = False)

for chunk in df:
    num = num + 1
    chunk.to_csv('%ggoogle us'%num ,sep='\t', encoding='utf-8'

上面的代码工作得很好,我现在有一个文件夹,里面有118个小文件。然后,我编写了代码,逐一浏览118个文件,提取符合特定条件的行,并将它们全部附加到我创建并命名为“google final us”的新csv文件中。代码如下:
import pandas as pd
import numpy
for i in range (1,118)
    file = open('google final us.csv','a')
    df = pd.read_csv('%ggoogle us'%i, error_bad_lines = False,
    warn_bad_lines = False)
    df_f = df.loc[(df['a']==7) & (df['b'] == 2016) & (df['c'] =='D') &
    df['d'] =='US')]
    file.write(df_f)

不幸的是,上面的代码给出了以下错误:
KeyError                                  Traceback (most recent call last)
C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in
get_loc(self, key, method, tolerance)
   1875             try:
-> 1876                 return self._engine.get_loc(key)
   1877             except KeyError:
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item
(pandas\hashtable.c:12408)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item
(pandas\hashtable.c:12359)()
KeyError: 'a'
During handling of the above exception, another exception occurred:
KeyError                                  Traceback (most recent call last)
<ipython-input-9-0ace0da2fbc7> in <module>()
      3 file = open('google final us.csv','a')
      4 df = pd.read_csv('1google us')
----> 5 df_f = df.loc[(df['a']==7) & (df['b'] == 2016) &
      (df['c'] =='D') & (df['d'] =='US')]
      6 file.write(df_f)
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in
__getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993
   1994     def _getitem_column(self, key):
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in
_getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000
   2001         # duplicate columns & possible reduce dimensionality
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\generic.py in
_get_item_cache(self, item)
  1343         res = cache.get(item)
  1344         if res is None:
-> 1345             values = self._data.get(item)
  1346             res = self._box_item_values(item, values)
  1347             cache[item] = res
C:\Users\...\Anaconda3\lib\site-packages\pandas\core\internals.py in
get(self, item, fastpath)
   3223
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items))
 [isnull(self.items)]
C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in
get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return
   self._engine.get_loc(self._maybe_cast_indexer(key))
   1879
   1880         indexer = self.get_indexer([key], method=method,
   tolerance=tolerance)
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item
(pandas\hashtable.c:12408)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item
(pandas\hashtable.c:12359)()
KeyError: 'a'

有什么问题吗?我读过很多其他的stackoverflow帖子(如Create dataframes from unique value pairs by filtering across multiple columnsHow can I break down a large csv file into small files based on common records by python),但仍然不知道如何做到这一点。另外,如果你有更好的方法提取数据比这个方法-请让我知道!

最佳答案

import pandas
import glob
csvFiles = glob.glob(path + "/split files/*.csv")
list_ = []
for files in csvFiles:
    df = pandas.read_csv(files, index_col=None)
    df_f = df[(df['a']==7) & (df['b'] == 2016) & (df['c'] =='D') & df['d']=='US')]
    list_.append(df_f)
frame = pandas.concat(list_, ignore_index=True)
frame.to_csv("Filtered Appended File")

将所有文件保存在工作目录的“拆分文件”文件夹中。。。
这应该管用。。。通过读取目录中所有必需的文件
读取csv需要大量内存。。。因此,打破它们并努力解决它们是一个可能的解决方案。。。看来你是在正确的轨道上。。。

10-07 13:30
查看更多