


I need to make a strategic decision about choice of the basis for data structure holding statistical data frames in my program.


I store hundreds of thousands of records in one big table. Each field would be of a different type, including short strings. I'd perform multiple regression analysis and manipulations on the data that need to be done quick, in real time. I also need to use something, that is relatively popular and well supported.



That is the most basic thing to do. Unfortunately it doesn't support strings. And I need to use numpy anyway for its statistical part, so this one is out of question.

ndarray能够在每一列中保存不同类型的数组(例如np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])).看来是天生的赢家,但是...

The ndarray has ability to hold arrays of different types in each column (e.g. np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])). It seems a natural winner, but...


This one is built with statistical use in mind, but is it efficient enough?

我读到,pandas.DataFrame不再是基于 > (尽管它共享相同的接口).任何人都可以阐明它吗?还是可能有更好的数据结构?

I read, that the pandas.DataFrame is no longer based on the numpy.ndarray (although it shares the same interface). Can anyone shed some light on it? Or maybe there is an even better data structure out there?


pandas.DataFrame很棒,并且可以与许多numpy很好地交互. DataFrame的大部分内容都是用Cython编写的,并且经过了优化.我怀疑Pandas API的易用性和丰富性会大大超过通过在numpy上滚动自己的接口所能获得的任何潜在好处.

pandas.DataFrame is awesome, and interacts very well with much of numpy. Much of the DataFrame is written in Cython and is quite optimized. I suspect the ease of use and the richness of the Pandas API will greatly outweigh any potential benefit you could obtain by rolling your own interfaces around numpy.


08-20 09:00