问题描述
我正在将庞大的http日志(80GB +)导入到Pandas HDFStore进行统计处理。即使在单个导入文件中,我需要在加载内容时对其进行批处理。到目前为止,我的策略是将已解析的行读入DataFrame,然后将DataFrame存储到HDFStore中。我的目标是使DataStore中的单个键的索引键是唯一的,但是每个DataFrame都会重新启动它自己的索引值。我预计HDFStore.append()将有一些机制来告诉它忽略DataFrame索引值,只是继续添加到我的HDFStore键的现有索引值,但似乎找不到它。如何导入DataFrames并忽略其中包含的索引值,同时让HDFStore增加其现有的索引值?下面的示例代码每10行分批。当然,真实的东西会更大。
I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.
if hd_file_name:
"""
HDF5 output file specified.
"""
hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
print hdf_output
columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result',
'response_size', 'referrer', 'user_agent', 'response_time']
source_name = str(log_file.name.rsplit('/')[-1]) # HDF5 Tables don't play nice with unicode so explicit str(). :(
batch = []
for count, line in enumerate(log_file,1):
data = parse_line(line, rejected_output = reject_output)
# Add our source file name to the beginning.
data.insert(0, source_name )
batch.append(data)
if not (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
batch = []
if (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
推荐答案
你可以这样做,只有诀窍是第一次存储表不存在,所以
You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer
will raise.
import pandas as pd
import numpy as np
import os
files = ['test1.csv','test2.csv']
for f in files:
pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)
path = 'test.h5'
if os.path.exists(path):
os.remove(path)
with pd.get_store(path) as store:
for f in files:
df = pd.read_csv(f,index_col=0)
try:
nrows = store.get_storer('foo').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.append('foo',df)
In [10]: pd.read_hdf('test.h5','foo')
Out[10]:
A B
0 0.772017 0.153381
1 0.304131 0.368573
2 0.995465 0.799655
3 -0.326959 0.923280
4 -0.808376 0.449645
5 -1.336166 0.236968
6 -0.593523 -0.359080
7 -0.098482 0.037183
8 0.315627 -1.027162
9 -1.084545 -1.922288
10 0.412407 -0.270916
11 1.835381 -0.737411
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
15 1.181344 0.354411
16 0.501892 -0.358361
17 0.633256 0.419397
18 0.932354 -0.603932
19 -0.341135 2.453220
您实际上并不一定需要全局唯一索引(除非你想要一个) (通过 PyTables
)通过唯一编号行来提供一个HDFStore 。您可以随时添加这些选择参数。
You actually don't necessarily need a global unique index, (unless you want one) as HDFStore
(through PyTables
) provides one by uniquely numbering rows. You can always add these selection parameters.
In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]:
A B
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
这篇关于如何向Pandas HDFStore附加大量数据,并获得自然独特的索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!