问题描述
在火花,能够设置像一些hadoop的配置设置,例如
In Spark, it is possible to set some hadoop configuration settings like, e.g.
System.setProperty("spark.hadoop.dfs.replication", "1")
此工作的,复制因子被设置为1。
假定是这种情况,我认为这种图案(prependingspark.hadoop到正规的hadoop配置属性),也将在工作
textinputformat.record.delimiter:
This works, the replication factor is set to 1.Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for thetextinputformat.record.delimiter:
System.setProperty("spark.hadoop.textinputformat.record.delimiter", "\n\n")
然而,似乎只是火花忽略此设置。
我是否以正确的方式设置 textinputformat.record.delimiter
?
有没有设置 textinputformat.record.delimiter
的一个简单的方法。我想,以避免写我自己的的InputFormat
,因为我真的只需要取得两换行分隔的记录。
However, it seems that spark just ignores this setting.Do I set the textinputformat.record.delimiter
in the correct way?Is there a simpler way of setting the textinputformat.record.delimiter
. I would like to avoid writing my own InputFormat
, since I really only need to obtain records delimited by two newlines.
推荐答案
我得到这个与下面的功能平原uncom pressed文件的工作。
I got this working with plain uncompressed files with the below function.
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
def nlFile(path: String) = {
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n")
sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
.map(_._2.toString)
}
这篇关于火花设置textinputformat.record.delimiter的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!