问题描述
当调用
df = pd.read_csv('somefile.csv')
我得到:
为什么 dtype
与
low_memory
相关的选项,为什么会使它 False
帮助解决这个问题?
已弃用的low_memory选项
low_memory
选项没有被正确地废弃,但它应该是,因为它实际上不会做任何不同的事情[]
你得到这个 low_memory
警告的原因是因为猜测每个列是非常需要记忆。熊猫尝试通过分析每列中的数据来确定要设置的dtype。
Dtype猜测(非常糟糕)
熊猫只能在整个文件被读取后确定一个列应该有什么样的dtype。这意味着在整个文件被读取之前,没有任何东西可以真正被解析,除非你在读取最后一个值时不得不更改该列的dtype。
一个文件具有一个名为user_id的列。
它包含1000万行,其中user_id始终是数字。
由于熊猫不知道它只是数字,它可能会保持原始的字符串,直到它读取整个文件。
指定dtypes(应该总是
添加
dtype = {'user_id': int}
到调用将使熊猫知道何时开始阅读文件,这只是整数。
另外值得注意的是,如果文件中的最后一行将有foobar
写在 user_id
列中,如果指定了上述dtype,加载将会崩溃。
$ d
csvdata =user_id,username
1,Alice
3,Bob
foobar,Caesar
sio = StringIO(csvdata)
pd.read_csv(sio,dtype = {user_id:int,username:object})
ValueError:long()与基数10:'foobar'
dtypes通常是一个麻木的事情,在这里阅读更多关于:
Gotchas,注意事项
设置 dtype = object
将使上述警告静默,但不会使其更高的内存效率,只有在任何情况下才有效。
设置 dtype = unicode
不会执行任何操作,因为numpy,一个 unicode
表示为对象
。
转换器的使用
@sparrow正确指出使用转换器来避免熊猫在列中遇到'foobar'
时发生爆炸被认定为 int
。我想补充说,转换器在熊猫中使用非常重,效率低下,应该被用作最后的手段。这是因为read_csv进程是一个单独的进程。
CSV文件可以逐行处理,因此可以通过简单地切割来并行更多的并行处理多个转换器文件分段并运行多个进程,大熊猫不支持。但这是一个不同的故事。
When calling
df = pd.read_csv('somefile.csv')
I get:
Why is the dtype
option related to low_memory
, and why would making it False
help with this problem?
The deprecated low_memory option
The low_memory
option is not properly deprecated, but it should be, since it does not actually do anything differently[source]
The reason you get this low_memory
warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.
Dtype Guessing (very bad)
Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.
Consider the example of one file which has a column called user_id.It contains 10 million rows where the user_id is always numbers.Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.
Specifying dtypes (should always be done)
adding
dtype={'user_id': int}
to the pd.read_csv()
call will make pandas know when it starts reading the file, that this is only integers.
Also worth noting is that if the last line in the file would have "foobar"
written in the user_id
column, the loading would crash if the above dtype was specified.
Example of broken data that breaks when dtypes are defined
import pandas as pd
from StringIO import StringIO
csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": object})
ValueError: invalid literal for long() with base 10: 'foobar'
dtypes are typically a numpy thing, read more about them here:http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html
Gotchas, caveats, notes
Setting dtype=object
will silence the above warning, but will not make it more memory efficient, only process efficient if anything.
Setting dtype=unicode
will not do anything, since to numpy, a unicode
is represented as object
.
Usage of converters
@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar'
in a column specified as int
. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.
CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.
这篇关于 pandas read_csv low_memory和dtype选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!