本文介绍了Pandas CParserError:错误标记数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的csv文件,有25列,我想读取为一个熊猫数据框。我使用 pandas.read_csv()
问题是一些行有额外的列,像这样:

I have a large csv file with 25 columns, that I want to read as a pandas dataframe. I am using pandas.read_csv().The problem is that some rows have extra columns, something like that:

        col1   col2   stringColumn   ...   col25
1        12      1       str1                 3
...
33657    2       3       str4                 6       4    3 #<- that line has a problem
33658    1      32       blbla                 #<-some columns have missing data too

当我尝试读取它时,会出现错误

When I try to read it, I get the error

CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28

出现在第一行。例如,如果我将值添加到同一个文件的第三行,它工作正常

The problem does not happen if the extra values appear in the first rows. For example if I add values to the third row of the same file it works fine

#that example works:
           col1   col2   stringColumn   ...   col25
    1        12      1       str1                 3
    2        12      1       str1                 3
    3        12      1       str1                 3       f    4
    ...
    33657    2       3       str4                 6       4    3 #<- that line has a problem
    33658    1      32       blbla                 #<-some columns have missing data too

我的猜测是,熊猫检查第一(n)行来确定列数,如果你有额外的列,在解析它有一个问题。

My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.

跳过建议的违规线路不是选项,这些行包含有价值的信息。

Skipping the offending lines like suggested here is not an option, those lines contain valuable information.

有人知道这个吗?

推荐答案

在我最初的文章中我提到在pandas.read_csv中不使用error_bad_lines= False。我决定这样做是更恰当和优雅的解决方案。我发现这篇文章相当有用。

In my initial post I mentioned not using "error_bad_lines" = False in pandas.read_csv. I decided that actually doing so is the more proper and elegant solution. I found this post quite useful.

我可以将python中的stdout重定向到某种字符串缓冲区吗?

Can I redirect the stdout in python into some sort of string buffer?

我添加了一点点代码

import sys
import re
from cStringIO import StringIO
import pandas as pd

fake_csv = '''1,2,3\na,b,c\na,b,c\na,b,c,d,e\na,b,c\na,b,c,d,e\na,b,c\n''' #bad data
fname = "fake.csv"
old_stderr = sys.stderr
sys.stderr = mystderr = StringIO()

df1 = pd.read_csv(StringIO(fake_csv),
                  error_bad_lines=False)

sys.stderr = old_stderr
log = mystderr.getvalue()
isnum = re.compile("\d+")

lines_skipped_log = [
    isnum.findall(i) + [fname]\
    for i in log.split("\n") if isnum.search(i)
        ]

columns=["line_num","flds_expct","num_fields","file"]
lines_skipped_log.insert(0,columns)


$ b b

您可以使用 lines_skipped_log 执行任何操作,例如输出到csv,创建数据框等。

From there you can do anything you want with lines_skipped_log such as output to csv, create a dataframe etc.

也许你有一个目录充满了文件。您可以从每个日志中创建一个大熊猫数据帧列表,并连接。从那里,你将有一个日志的行被跳过和哪些文件在你的指尖(字面上!)。

Perhaps you have a directory full of files. You can create a list of pandas data frames out of each log and concatenate. From there you will have a log of what rows were skipped and for which files at your fingertips (literally!).

这篇关于Pandas CParserError:错误标记数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!