问题描述
我正在使用CSV格式的大型数据集。我试图逐列处理数据,然后将数据附加到HDF文件中的帧。所有这一切都使用熊猫。我的动机是,虽然整个数据集比我的物理内存大得多,列大小是可管理的。在稍后的阶段,我将通过将这些列加载到内存中并对其进行操作来执行特征智能逻辑回归。
I am working with a large dataset in CSV format. I am trying to process the data column-by-column, then append the data to a frame in an HDF file. All of this is done using Pandas. My motivation is that, while the entire dataset is much bigger than my physical memory, the column size is managable. At a later stage I will be performing feature-wise logistic regression by loading the columns back into memory one by one and operating on them.
我可以创建一个新的HDF文件,并使用第一列创建一个新框架:
I am able to make a new HDF file and make a new frame with the first column:
hdf_file = pandas.HDFStore('train_data.hdf')
feature_column = pandas.read_csv('data.csv', usecols=[0])
hdf_file.append('features', feature_column)
但是之后,当试图向框架中添加一个新列时,我得到一个ValueError:
But after that, I get a ValueError when trying to append a new column to the frame:
feature_column = pandas.read_csv('data.csv', usecols=[1])
hdf_file.append('features', feature_column)
堆栈跟踪和错误消息:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [srch_id] on appending data
我是处理大型数据集和有限内存的新手,所以我建议使用这些数据的替代方法。
I am new to working with large datasets and limited memory, so I am open to suggestions for alternate ways to work with this data.
推荐答案
完整文档是,有些食谱策略
complete docs are here, and some cookbook strategies here
PyTables是面向行的,所以你只能追加行。读取csv chunk-by-chunk,然后在你去时添加整个框架,如下:
PyTables is row-oriented, so you can only append rows. Read the csv chunk-by-chunk then append the entire frame as you go, something like this:
store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
store.append('df',chunk)
store.close()
你必须小心,因为它可能是当读取chunk-by-chunk以具有不同的dtypes时,例如你有一个整数,像没有缺少值的列,直到说第二个块。第一个块将该列作为 int64
,第二个作为 float64
。您可能需要使用 dtype
关键字将dtypes强制为 read_csv
,请参阅。
You must be a tad careful as it is possiible for the dtypes of the resultant frrame when read chunk-by-chunk to have different dtypes, e.g. you have a integer like column that doesn't have missing values until say the 2nd chunk. The first chunk would have that column as an int64
, while the second as float64
. You may need to force dtypes with the dtype
keyword to read_csv
, see here.
也是一个类似的问题。
here is a similar question as well.
这篇关于将列添加到Pandas中HDF文件的框架的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!