本文介绍了在打开的文件上使用Pandas read_csv()两次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在试验熊猫时,我注意到了pandas.read_csv的一些奇怪的行为,并且想知道有更多经验的人能否解释可能导致的原因。



首先,这是我从.csv文件创建一个新的pandas.dataframe的基本类定义:

  import pandas as pd 

class dataMatrix:
def __init __(self,filepath):
self.path = filepath#目标.csv文件的文件路径。
self.csvfile = open(filepath)#打开文件。
self.csvdataframe = pd.read_csv(self.csvfile)

好,并在我的__主__.py中调用类成功创建了一个pandas数据框:

 从dataMatrix.py import dataMatrix 

testObject = dataMatrix('/ path / to / csv / file')

但我注意到这个过程是自动将.csv的第一行设置为pandas.dataframe.columns索引。相反,我决定对列编号。因为我不想假设我知道列的数目之前,我采取的方法打开文件,加载到一个数据框架,计数列,然后重新加载数据帧与适当的列数使用范围)。

  import pandas as pd 

class dataMatrix:
def __init __(self,文件路径):
self.path = filepath
self.csvfile = open(filepath)

#加载.csv文件以计数列。
self.csvdataframe = pd.read_csv(self.csvfile)
#计算列。
self.numcolumns = len(self.csvdataframe.columns)
#重新加载.csv文件,手动设置列名为他们的
#编号。
self.csvdataframe = pd.read_csv(self.csvfile,
names = range(self.numcolumns))

在__ main __.py中保存我的处理,我回到了一个带有正确名称(0 ... 499)的列数(在这种情况下为500)的数据框架,但它



抓住我的头,我决定关闭self.csvfile并重新加载它:

  import pandas as pd 

class dataMatrix:
def __init __(self,filepath):
self.path = filepath
self.csvfile = open(filepath)

#加载.csv文件以对列进行计数。
self.csvdataframe = pd.read_csv(self.csvfile)
#计算列。
self.numcolumns = len(self.csvdataframe.columns)

#关闭.csv文件。 #< ---- +++++++
self.csvfile.close()#< ---- Added
#重新打开文件。 #< ---- Block
self.csvfile = open(filepath)#< ---- +++++++

#重新加载.csv文件,手动将列名称设置为它们的
#编号。
self.csvdataframe = pd.read_csv(self.csvfile,
names = range(self.numcolumns))

使用pandas.dataframe编号为0 ... 499的列和所有255个后续数据行,正确地关闭文件并重新打开它。



我的问题是为什么关闭文件并重新打开它会产生影响?

解决方案

档案

  open(filepath)

返回文件句柄迭代器。迭代器有利于一次遍历其内容。

  self.csvdataframe = pd.read_csv(self.csvfile)
pre>

读取内容并耗尽迭代器。 pd.read_csv 的后续调用认为迭代器为空。



请注意,您可以避免这个问题, pd.read_csv 的文件路径:

  class dataMatrix:
def __init __(self,filepath):
self.path = filepath

#加载.csv文件以计数列。
self.csvdataframe = pd.read_csv(filepath)
#计算列。
self.numcolumns = len(self.csvdataframe.columns)


#重新加载.csv文件,手动设置列名为他们的
#号。
self.csvdataframe = pd.read_csv(filepath,
names = range(self.numcolumns))


b $ b

pd.read_csv 会为您打开(并关闭)文件。



PS 。另一种选择是通过调用 self.csvfile.seek(0)将文件句柄重置为文件的开头,但使用 pd.read_csv filepath,...)更容易。






code> pd.read_csv 两次(这是低效的),您可以重命名列如下:

  class dataMatrix:
def __init __(self,filepath):
self.path = filepath

#加载.csv文件以计数列。
self.csvdataframe = pd.read_csv(filepath)
self.numcolumns = len(self.csvdataframe.columns)
self.csvdataframe.columns = range(self.numcolumns)


As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.

To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath  # File path to the target .csv file.
        self.csvfile = open(filepath)  # Open file.
        self.csvdataframe = pd.read_csv(self.csvfile)

Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:

From dataMatrix.py import dataMatrix

testObject = dataMatrix('/path/to/csv/file')

But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)
        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile,
                                        names=range(self.numcolumns))

Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).

Scratching my head, I decided to close self.csvfile and reload it like so:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)

        # Close the .csv file.         #<---- +++++++
        self.csvfile.close()           #<----  Added
        # Re-open file.                #<----  Block
        self.csvfile = open(filepath)  #<---- +++++++

        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile,
                                        names=range(self.numcolumns))

Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.

My question is why does closing the file and re-opening it make a difference?

解决方案

When you open a file with

open(filepath)

a file handle iterator is returned. An iterator is good for one pass through its contents. So

self.csvdataframe = pd.read_csv(self.csvfile)

reads the contents and exhausts the iterator. Subsequent calls to pd.read_csv thinks the iterator is empty.

Note that you could avoid this problem by just passing the file path to pd.read_csv:

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)


        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(filepath,
                                        names=range(self.numcolumns))

pd.read_csv will then open (and close) the file for you.

PS. Another option is to reset the file handle to the beginning of the file by calling self.csvfile.seek(0), but using pd.read_csv(filepath, ...) is still easier.


Even better, instead of calling pd.read_csv twice (which is inefficient), you could rename the columns like this:

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        self.numcolumns = len(self.csvdataframe.columns)
        self.csvdataframe.columns = range(self.numcolumns)

这篇关于在打开的文件上使用Pandas read_csv()两次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 07:48