问题描述
我是Spark&的新手.了解数据框,操作和建筑学.在阅读有关RDD和Dataframe之间的比较时,我对RDD和Dataframe的数据结构感到困惑.以下是我的观察结果,如果发现错误,请帮助澄清/更正
I am new to Spark & learning about the Dataframe,operations & architecture. While reading about the comparison between RDD and Dataframe, i got confused with the data structure of both RDD and Dataframe. Below are my observation, Please help to clarify/correct it if it is wrong
1)如果源数据是群集(例如:HDFS),则RDD以分布方式(块)跨群集中的节点存储在计算机RAM中.
1)RDD is stored in the computer RAM in a distributed manner(blocks) across the nodes in a cluster,if the source data is an a cluster(eg: HDFS).
如果数据源只是单个CSV文件,则数据将分发到正在运行的服务器(如果是笔记本电脑)的RAM中的多个块.我说的对吗?
If the data source is just a single CSV file, the data will be distributed to multiple blocks in the RAM of running server(if Laptop). Am i right?
2)块和分区之间是否存在任何关系?哪个是超级套装?
2)Is there any relationship between block and partition? Which one is super set?
3)数据框:数据框是否也以与RDD相同的方式存储?如果我仅将源数据存储到数据帧中,是否将在支持中创建RDD?
3)Dataframe: Does the Dataframe also getting stored in the same way as RDD? Whether RDD will be created in the backed if i am storing my source data into dataframe alone?
先谢谢您了:)
推荐答案
如果启用了缓存
或 checkpointing
,它也可能存储在内存或磁盘中.另外,改组总是涉及磁盘写入.
If caching
or checkpointing
is enabled it is also might be stored either in memory or on disk. Also, shuffling always involves disk write.
CSV文件将被分为多个分区,每个任务将仅读取大块数据(起始端偏移量).
CSV file will be split into multiple partitions, and each task will only read a chunk of data (start-end offsets).
这有点令人困惑,请看以下 answer 表示 split
是输入数据的逻辑划分,而 block
是数据的物理划分.Spark使用自己的术语,Spark中的 partition
与Hadoop中的split具有大致相同的含义.
It is a bit confusing, take a look at this answer which states that split
is a logical division of the input data while a block
is a physical division of data.Spark uses its own terminology and partition
in Spark has roughly the same meaning as split in Hadoop.
从HDFS读取文件时 HadoopRDD 正在使用,并且在后台,每个 split
将成为一个 partition
.
When a file is read from HDFS HadoopRDD is being used and under the hood, each split
will become a partition
.
Dataframe只是幕后的RDD [InternalRow].
看看 SparkPlan .
Dataframe is nothing else than RDD[InternalRow] under the hood.
Take a look at the SparkPlan.
这篇关于Spark RDD与Dataframe-数据存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!