问题描述
我想知道在对大型数据集(大型但仍适合内存)进行计算时,普通数组与JuliaDB或DataFrame的性能是否存在差异?
I wonder if there's a difference in performance of plain Array versus JuliaDB or DataFrame to do calculations on huge data sets (large but still fit in memory)?
我可以使用普通数组和算法进行排序,分组,归约等.那么为什么我需要JuliaDB或DataFrame?
I can use plain arrays and algorithms to do sorting, grouping, reducing etc. So why do I need JuliaDB or DataFrame?
我有点理解为什么Python需要Pandas-因为它将慢速python转换为快速C.但是为什么Julia却需要JuliaDB或DataFrame-Julia已经快了.
I kinda understand why Python needs Pandas - because it translates slow python into fast C. But why Julia needs JuliaDB or DataFrame - Julia already fast.
推荐答案
这可能是一个广泛的话题.让我重点介绍我认为关键的功能.
This is a possibly broad topic. Let me highlight the features that are key in my opinion.
- 它们允许您存储具有不同类型的数据列.您可以在数组中执行相同的操作,但是通常,它们必须是
Any
的数组,这要比具有具体类型的数据列要慢,并且占用更多的内存. - 您可以使用名称访问列.但是,这是次要功能-例如NamedArrays.jl提供具有命名维度的类似数组的类型.
- 另一个好处是,存在一个基于列有名称的事实的生态系统(例如,连接两个
DataFrame
或使用GLM.jl建立GLM模型).
- They allow you to store columns of data having different types. You can do the same in arrays, but then they have to be arrays of
Any
in general which will be slower and use up more memory than having data columns having concrete types. - You can access columns using names. However, this is a secondary feature - e.g. NamedArrays.jl provides an array-like type with named dimensions.
- The additional benefit is that there is an ecosystem built on the fact that columns have names (e.g. joining two
DataFrame
s or building GLM model using GLM.jl).
这种类型的存储(带有名称的异构列)是关系数据库中表的表示.
This type of storage (heterogeneous columns with names) is a representation of table in relational databases.
- JuliaDB.jl支持分布式并行性; DataFrames.jl的正常使用假定数据适合内存(您可以使用
SharedArray
来解决此问题,但这不是设计的一部分),并且如果要并行化计算,则必须手动进行; - JuliaDB.jl支持索引,而DataFrames.jl当前不支持索引;
- JuliaDB.jl的列类型是稳定的,而对于DataFrames.jl当前则不是.结果是:
- 每次使用新的数据结构类型创建JuliaDB.jl时,都必须重新编译应用于该类型的所有函数(对于大型数据集,可以忽略这些功能,但是当处理许多异构的小型数据集时,可以使用明显的性能影响);
- 在使用DataFrames.jl时,您必须使用特殊的技术来确保类型推断以实现高性能(在某些情况下(最值得注意的是屏障功能,如此处).
- JuliaDB.jl supports distributed parallelism; normal use of DataFrames.jl assumes that data fits into memory (you can work around this using
SharedArray
but this is not a part of the design) and if you want to parallelise computations you have to do it manually; - JuliaDB.jl supports indexing while DataFrames.jl currently does not;
- Column types of JuliaDB.jl are stable and for DataFrames.jl currently they are not. The consequences are:
- when using JuliaDB.jl each time a new type of data structure is created all functions that are applied over this type have to be recompiled (which for large data sets can be ignored but when working with many heterogeneous small data sets can have a visible performance impact);
- when using DataFrames.jl you have to use special techniques ensuring type inference to achieve high performance is some situations (most notably barrier functions as discussed here).
这篇关于JuliaDB或DataFrame是否比普通数组快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!