问题描述
在火花,我知道如何使用 wholeTextFiles
和 TEXTFILES
,但我不知道这在使用时。以下是我目前所知:
In the spark, I understand how to use wholeTextFiles
and textFiles
, but I'm not sure which to use when. Here is what I know so far:
- 当不受线分割的文件时,应该使用
wholeTextFiles
,否则使用TEXTFILES
。
- When dealing with files that are not split by line, one should use
wholeTextFiles
, otherwise usetextFiles
.
我会在默认情况下, wholeTextFiles
和 TEXTFILES
分区按文件内容,并通过线分别认为。但是,他们都允许您更改参数 minPartitions
。
I would think that by default, wholeTextFiles
and textFiles
partition by file content, and by lines, respectively. But, both of them allow you to change the parameter minPartitions
.
那么,如何更改分区如何影响这些被处理?
So, how does changing the partitions affect how these are processed?
举例来说,假设我有100线中的一个非常大的文件。什么是处理为 wholeTextFiles
100 partiions,加工为文本文件
(由划分它行之间的差异通过线),使用分区之前100的默认
For example, say I have one very large file with 100 lines. What would be the difference between processing it as wholeTextFiles
with 100 partiions, and processing it as textFile
(which partitions it line by line) using the default of parition 100.
什么是它们之间的区别?
What is the difference between these?
推荐答案
有关参考, wholeTextFiles
使用 WholeTextFileInputFormat
这扩展<一个href=\"https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/ma$p$pduce/lib/input/CombineFileInputFormat.html\"相对=nofollow> CombineFileInputFormat 。
For reference, wholeTextFiles
uses WholeTextFileInputFormat
which extends CombineFileInputFormat.
这是一对夫妇的音符 wholeTextFiles
。
- 按
wholeTextFiles
返回在RDD每个记录的文件名和文件的全部内容。这意味着文件不能被分割(所有)。 - 因为它扩展
CombineFileInputFormat
,它会尝试较小的文件组结合成一个分区。
- Each record in the RDD returned by
wholeTextFiles
has the file name and the entire contents of the file. This means that a file cannot be split (at all). - Because it extends
CombineFileInputFormat
, it will try to combine groups of smaller files into one partition.
如果我在一个目录中的两个小的文件,它是可能的,这两个文件都在一个单一的分区结束。如果我设置 minPartitions = 2
,那么我可能会得到两个分区,而不是回来。
If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2
, then I will likely get two partitions back instead.
现在,如果我要设置 minPartitions = 3
,我依然会回来两个分区,因为合约 wholeTextFiles
的是,在RDD每个记录包含整个文件
Now if I were to set minPartitions=3
, I will still get back two partitions because the contract for wholeTextFiles
is that each record in the RDD contain an entire file.
这篇关于如何分区数量影响`wholeTextFiles`和`textFiles`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!