问题描述
我继承了以Stata .dta格式保存的数据文件.我可以用scikits.statsmodels
genfromdta()
函数加载它.这会将我的数据放入一维NumPy数组中,其中每个条目都是一行数据,存储在24个元组中.
I've inherited a data file saved in the Stata .dta format. I can load it in with scikits.statsmodels
genfromdta()
function. This puts my data into a 1-dimensional NumPy array, where each entry is a row of data, stored in a 24-tuple.
In [2]: st_time = time.time(); initialload = sm.iolib.genfromdta("/home/myfile.dta"); ed_time = time.time(); print (ed_time - st_time)
666.523324013
In [3]: type(initialload)
Out[3]: numpy.ndarray
In [4]: initialload.shape
Out[4]: (4809584,)
In [5]: initialload[0]
Out[5]: (19901130.0, 289.0, 1990.0, 12.0, 19901231.0, 18.0, 40301000.0, 'GB', 18242.0, -2.368063, 1.0, 1.7783716290878204, 4379.355, 66.17669677734375, -999.0, -999.0, -0.60000002, -999.0, -999.0, -999.0, -999.0, -999.0, 0.2, 371.0)
我很好奇是否有一种有效的方法可以将其安排到Pandas DataFrame中.从我阅读的内容来看,逐行构建DataFrame似乎效率很低...但是我有什么选择呢?
I am curious if there's an efficient way to arrange this into a Pandas DataFrame. From what I've read, building up a DataFrame row-by-row seems quite inefficient... but what are my options?
我编写了一个非常慢的首遍程序,它仅将每个元组读取为单行DataFrame并将其追加.只是想知道是否还有其他更好的方法.
I've written a pretty slow first-pass that just reads each tuple as a single-row DataFrame and appends it. Just wondering if anything else is known to be better.
推荐答案
pandas.DataFrame(initialload, columns=list_of_column_names)
这篇关于从大型元组/行列表中高效构建Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!