问题描述
我们将CDH 5.13与Spark 2.3.0和S3Guard一起使用.在具有相同资源的EMR 5.x/6.x上运行相同的作业后,性能下降了5-20倍.根据 https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html 默认提交者(自5.20开始)对S3A不利.我们测试了EMR-5.15.1,并获得了与Hadoop相同的结果.
We are using CDH 5.13 with Spark 2.3.0 and S3Guard.After running the same job on EMR 5.x / 6.x with the same resources we got 5-20x performance degradation.According to the https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html default committer(since 5.20) is not good for S3A.We tested EMR-5.15.1 and got the same results as on Hadoop.
如果我尝试使用魔术提交者我得到了
If I am trying to use Magic Commiter I am getting
py4j.protocol.Py4JJavaError: An error occurred while calling o72.save.
: java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
我的代码是(+我通过EMR配置配置了S3Guard):
My code is (+I am configured S3Guard via EMR config):
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext, SparkSession
from pyspark.sql.functions import *
sconf = SparkConf()
sconf.set("spark.hadoop.fs.s3a.committer.name", "magic")
sconf.set("spark.hadoop.fs.s3a.committer.magic.enabled", "true")
sconf.set("spark.sql.sources.commitProtocolClass", "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol")
sconf.set("spark.sql.parquet.output.committer.class", "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter")
sconf.set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
sconf.set("spark.hadoop.fs.s3a.commiter.staging.conflict-mode", "replace")
sc = SparkContext(appName="s3acommitter", conf = sconf)
spark = SparkSession(sc)
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
sourceDF = spark.range(0, 10000)
datasets = "s3a://parquet/commiter-test"
sourceDF.write.format("parquet").save(datasets + "parquet")
sc.stop()
在 https://repo.hortonworks.com/content/repositories/releases/org/apache/spark/spark-hadoop-cloud_2.11/我找不到 Spark 2.4.4&Hadoop 3.2.1
如何在EMR上启用Magic Commiter?
How to enable Magic Commiter on EMR?
Spark日志:
20/11/25 21:49:38 INFO ParquetFileFormat: Using user defined output committer for Parquet: com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
20/11/25 21:49:38 WARN ParquetOutputFormat: Setting parquet.enable.summary-metadata is deprecated, please use parquet.summary.metadata.level
20/11/25 21:49:38 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
20/11/25 21:49:38 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
20/11/25 21:49:38 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
20/11/25 21:49:38 INFO EmrOptimizedParquetOutputCommitter: EMR Optimized committer is not supported by this filesystem (org.apache.hadoop.fs.s3a.S3AFileSystem)
20/11/25 21:49:38 INFO EmrOptimizedParquetOutputCommitter: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
20/11/25 21:49:38 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
推荐答案
您将得到未发现类错误,因为spark(在spark-hadoop-cloud中)的某些绑定类不在类路径中.即使是这样,您也将被下一个障碍所阻挡:那些提交者不在EMR中.
You are getting the class not found errors as some of the binding classes in spark (in spark-hadoop-cloud) aren't on the classpath. Even if they were, you'd be blocked at the next hurdle: those committers aren't in EMR.
Amazon EMR具有自己的S3A提交者版本:使用EMRFS S3优化的提交器
Amazon EMR has its own version of the S3A committers: Using the EMRFS S3-optimized Committer
这篇关于适用于EMR 6.x上S3A的S3Guard和拼花魔术提交者的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!