问题描述
我正在加载一个 CSV 文件(如果你想要特定的文件,它是来自 http://www.kaggle.com/c/loan-default-prediction).在 numpy 中加载 csv 比在 Pandas 中显着花费更多的时间.
I'm loading a CSV file (if you want the specific file, it's the training csv from http://www.kaggle.com/c/loan-default-prediction). Loading the csv in numpy takes dramatically more time than in pandas.
timeit("genfromtxt('train_v2.csv', delimiter=',')", "from numpy import genfromtxt", number=1)
102.46608114242554
timeit("pandas.io.parsers.read_csv('train_v2.csv')", "import pandas", number=1)
13.833590984344482
我还要提到 numpy 内存使用量波动更大,更高,并且在加载后显着更高的内存使用量.(numpy 为 2.49 GB,pandas 为 ~600MB)pandas 中的所有数据类型都是 8 个字节,因此不同的 dtypes 没有区别.我的内存使用量远远没有达到最大值,因此时间差不能归因于分页.
I'll also mention that the numpy memory usage fluctuates much more wildly, goes higher, and has significantly higher memory usage once loaded. (2.49 GB for numpy vs ~600MB for pandas) All datatypes in pandas are 8 bytes, so differing dtypes is not the difference. I got nowhere near maxing out my memory usage, so the time difference can not be ascribed to paging.
这种差异有什么原因吗?genfromtxt 效率低下吗?(并泄漏一堆内存?)
Any reason for this difference? Is genfromtxt just way less efficient? (And leaks a bunch of memory?)
numpy 1.8.0 版
numpy version 1.8.0
熊猫版本 0.13.0-111-ge29c8e8
pandas version 0.13.0-111-ge29c8e8
推荐答案
'genfromtxt' 来自 Numpy 模块运行两个主循环.第一个将文件中的所有行转换为字符串,然后另一个循环将每个字符串转换为其数据类型.但是与 loadtxt 和 read_csv 等其他命令相比,您在 'genfromtxt' 中获得了更大的灵活性.
'genfromtxt' from the Numpy module run two main loops. First one convert all the lines in a file to string and then other loop convert each string to their data type. But you get more flexibility in 'genfromtxt' than other command like loadtxt and read_csv.
这篇关于Numpy Genfromtxt 比 Pandas read_csv 慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!