问题描述
当我在试验熊猫时,我注意到了pandas.read_csv的一些奇怪的行为,并且想知道有更多经验的人能否解释可能导致的原因。
首先,这是我从.csv文件创建一个新的pandas.dataframe的基本类定义:
import pandas as pd
class dataMatrix:
def __init __(self,filepath):
self.path = filepath#目标.csv文件的文件路径。
self.csvfile = open(filepath)#打开文件。
self.csvdataframe = pd.read_csv(self.csvfile)
好,并在我的__主__.py中调用类成功创建了一个pandas数据框:
从dataMatrix.py import dataMatrix
testObject = dataMatrix('/ path / to / csv / file')
但我注意到这个过程是自动将.csv的第一行设置为pandas.dataframe.columns索引。相反,我决定对列编号。因为我不想假设我知道列的数目之前,我采取的方法打开文件,加载到一个数据框架,计数列,然后重新加载数据帧与适当的列数使用范围)。
import pandas as pd
class dataMatrix:
def __init __(self,文件路径):
self.path = filepath
self.csvfile = open(filepath)
#加载.csv文件以计数列。
self.csvdataframe = pd.read_csv(self.csvfile)
#计算列。
self.numcolumns = len(self.csvdataframe.columns)
#重新加载.csv文件,手动设置列名为他们的
#编号。
self.csvdataframe = pd.read_csv(self.csvfile,
names = range(self.numcolumns))
在__ main __.py中保存我的处理,我回到了一个带有正确名称(0 ... 499)的列数(在这种情况下为500)的数据框架,但它
抓住我的头,我决定关闭self.csvfile并重新加载它:
import pandas as pd
class dataMatrix:
def __init __(self,filepath):
self.path = filepath
self.csvfile = open(filepath)
#加载.csv文件以对列进行计数。
self.csvdataframe = pd.read_csv(self.csvfile)
#计算列。
self.numcolumns = len(self.csvdataframe.columns)
#关闭.csv文件。 #< ---- +++++++
self.csvfile.close()#< ---- Added
#重新打开文件。 #< ---- Block
self.csvfile = open(filepath)#< ---- +++++++
#重新加载.csv文件,手动将列名称设置为它们的
#编号。
self.csvdataframe = pd.read_csv(self.csvfile,
names = range(self.numcolumns))
使用pandas.dataframe编号为0 ... 499的列和所有255个后续数据行,正确地关闭文件并重新打开它。
我的问题是为什么关闭文件并重新打开它会产生影响?
档案
open(filepath)
返回文件句柄迭代器。迭代器有利于一次遍历其内容。
self.csvdataframe = pd.read_csv(self.csvfile)
pre>
读取内容并耗尽迭代器。
pd.read_csv
的后续调用认为迭代器为空。
请注意,您可以避免这个问题,
pd.read_csv
的文件路径:class dataMatrix:
def __init __(self,filepath):
self.path = filepath
#加载.csv文件以计数列。
self.csvdataframe = pd.read_csv(filepath)
#计算列。
self.numcolumns = len(self.csvdataframe.columns)
#重新加载.csv文件,手动设置列名为他们的
#号。
self.csvdataframe = pd.read_csv(filepath,
names = range(self.numcolumns))
b $ b
pd.read_csv
会为您打开(并关闭)文件。
PS 。另一种选择是通过调用
self.csvfile.seek(0)
将文件句柄重置为文件的开头,但使用pd.read_csv filepath,...)
更容易。
code> pd.read_csv 两次(这是低效的),您可以重命名列如下:
class dataMatrix:
def __init __(self,filepath):
self.path = filepath
#加载.csv文件以计数列。
self.csvdataframe = pd.read_csv(filepath)
self.numcolumns = len(self.csvdataframe.columns)
self.csvdataframe.columns = range(self.numcolumns)
As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.
To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:
import pandas as pd class dataMatrix: def __init__(self, filepath): self.path = filepath # File path to the target .csv file. self.csvfile = open(filepath) # Open file. self.csvdataframe = pd.read_csv(self.csvfile)
Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:
From dataMatrix.py import dataMatrix testObject = dataMatrix('/path/to/csv/file')
But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().
import pandas as pd class dataMatrix: def __init__(self, filepath): self.path = filepath self.csvfile = open(filepath) # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(self.csvfile) # Count the columns. self.numcolumns = len(self.csvdataframe.columns) # Re-load the .csv file, manually setting the column names to their # number. self.csvdataframe = pd.read_csv(self.csvfile, names=range(self.numcolumns))
Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).
Scratching my head, I decided to close self.csvfile and reload it like so:
import pandas as pd class dataMatrix: def __init__(self, filepath): self.path = filepath self.csvfile = open(filepath) # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(self.csvfile) # Count the columns. self.numcolumns = len(self.csvdataframe.columns) # Close the .csv file. #<---- +++++++ self.csvfile.close() #<---- Added # Re-open file. #<---- Block self.csvfile = open(filepath) #<---- +++++++ # Re-load the .csv file, manually setting the column names to their # number. self.csvdataframe = pd.read_csv(self.csvfile, names=range(self.numcolumns))
Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.
My question is why does closing the file and re-opening it make a difference?
解决方案When you open a file with
open(filepath)
a file handle iterator is returned. An iterator is good for one pass through its contents. So
self.csvdataframe = pd.read_csv(self.csvfile)
reads the contents and exhausts the iterator. Subsequent calls to
pd.read_csv
thinks the iterator is empty.Note that you could avoid this problem by just passing the file path to
pd.read_csv
:class dataMatrix: def __init__(self, filepath): self.path = filepath # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(filepath) # Count the columns. self.numcolumns = len(self.csvdataframe.columns) # Re-load the .csv file, manually setting the column names to their # number. self.csvdataframe = pd.read_csv(filepath, names=range(self.numcolumns))
pd.read_csv
will then open (and close) the file for you.PS. Another option is to reset the file handle to the beginning of the file by calling
self.csvfile.seek(0)
, but usingpd.read_csv(filepath, ...)
is still easier.Even better, instead of calling
pd.read_csv
twice (which is inefficient), you could rename the columns like this:class dataMatrix: def __init__(self, filepath): self.path = filepath # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(filepath) self.numcolumns = len(self.csvdataframe.columns) self.csvdataframe.columns = range(self.numcolumns)
这篇关于在打开的文件上使用Pandas read_csv()两次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!