问题描述
我无法弄清楚这一点,但是我试图在AWS Glue中使用直接输出提交者:
I haven't been able to figure this out, but I'm trying to use a direct output committer with AWS Glue:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
是否可以将此配置与AWS Glue一起使用?
Is it possible to use this configuration with AWS Glue?
推荐答案
选项1:
Option 1 :
胶水使用spark上下文,您也可以将hadoop配置设置为aws胶水.因为内部动态框架是一种数据框架.
Glue uses spark context you can set hadoop configuration to aws glue as well. since internally dynamic frame is kind of dataframe.
sc._jsc.hadoopConfiguration().set("mykey","myvalue")
我认为您也需要像这样添加相应的课程
I think you neeed to add the correspodning class also like this
sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")
示例代码段:
sc = SparkContext()
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
glueContext = GlueContext(sc)
spark = glueContext.spark_session
以证明该配置存在....
To prove that that configuration exists ....
在python中调试:
sc._conf.getAll() // print this
在scala中调试:
sc.getConf.getAll.foreach(println)
选项2:
Option 2:
您尝试使用胶水的工作参数的另一面:
Other side you try using job parameters of the glue :
https://docs.aws.amazon.com/glue/latest/dg/add-job.html 具有键值属性,如文档中所述
https://docs.aws.amazon.com/glue/latest/dg/add-job.htmlwhich has key value properties like mentioned in docs
'--myKey' : 'value-for-myKey'
您可以按照下面的屏幕截图编辑作业,并使用--conf
you can follow below screen shot for editing job and specifying the parameters with --conf
选项3:
如果您正在使用,aws cli可以在下面尝试... https://docs. aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
Option 3:
If you are using, aws cli you can try below...https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html
有趣,如下所示.但我不知道为什么它被暴露.
Fun is they mentioned in the docs dont set message like below. but i dont know why it was exposed.
这篇关于将Spark fileoutputcommitter.algorithm.version = 2与AWS Glue一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!