问题描述
我对将自定义对象流式传输到pandas数据框感兴趣.根据文档,任何具有读权限的对象()方法可以使用.但是,即使实现了此功能,我仍然会收到此错误:
I am interested in streaming a custom object into a pandas dataframe. According to the documentation, any object with a read() method can be used. However, even after implementing this function I am still getting this error:
这是该对象的简单版本,以及我的调用方式:
Here is a simple version of the object, and how I am calling it:
class DataFile(object):
def __init__(self, files):
self.files = files
def read(self):
for file_name in self.files:
with open(file_name, 'r') as file:
for line in file:
yield line
import pandas as pd
hours = ['file1.csv', 'file2.csv', 'file3.csv']
data = DataFile(hours)
df = pd.read_csv(data)
我是否缺少某些东西,还是无法在Pandas中使用自定义生成器?当我调用read()方法时,它就可以正常工作.
Am I missing something, or is it just not possible to use a custom generator in Pandas? When I call the read() method it works just fine.
我想使用自定义对象而不是将数据帧并置在一起的原因是,看是否有可能减少内存使用量.我过去曾经使用过 gensim 库,它使使用自定义数据对象真的非常容易,因此我希望找到一些类似的方法.
The reason I want to use a custom object rather than concatenating the dataframes together is to see if it is possible to reduce memory usage. I have used the gensim library in the past, and it makes it really easy to use custom data objects, so I was hoping to find some similar approach.
推荐答案
通过子类化 io.RawIOBase
.并使用机械蜗牛的iterstream
,您可以将任何可迭代的字节转换为类似文件的对象:
One way to make a file-like object in Python3 by subclassing io.RawIOBase
.And using Mechanical snail's iterstream
,you can convert any iterable of bytes into a file-like object:
import tempfile
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
class DataFile(object):
def __init__(self, files):
self.files = files
def read(self):
for file_name in self.files:
with open(file_name, 'rb') as f:
for line in f:
yield line
def make_files(num):
filenames = []
for i in range(num):
with tempfile.NamedTemporaryFile(mode='wb', delete=False) as f:
f.write(b'''1,2,3\n4,5,6\n''')
filenames.append(f.name)
return filenames
# hours = ['file1.csv', 'file2.csv', 'file3.csv']
hours = make_files(3)
print(hours)
data = DataFile(hours)
df = pd.read_csv(iterstream(data.read()), header=None)
print(df)
打印
0 1 2
0 1 2 3
1 4 5 6
2 1 2 3
3 4 5 6
4 1 2 3
5 4 5 6
这篇关于在pandas.read_csv()中使用自定义对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!