问题描述
我向YARN提交了一份工作(火花2.1.1 + kafka 0.10.2.1),它连接到一个安全的hbase集群。当我以本地模式(spark.master = local [*])运行时,这项工作表现得很好。然而,只要我提交作业与主人作为YARN(和部署模式作为客户端),我看到以下错误消息 - $ / b
$ b
导致:javax.security。 auth.login.LoginException:无法从用户获取密码
我正在关注hortonworks的建议,关于hbase和keytab的纱线群集等。请看这篇kb文章 -
任何指针可能会发生什么?
登录到hbase = >
UserGroupInformation.setConfiguration(hbaseConf)
val keyTa b =keytab-location)
val principal =kerberos-principal
val ugi =用户组信息。 .doAs(new PrivilegedExceptionAction [Void](){
override def run:Void = {
hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
null
}
$ b $ p另外,我尝试了登录的替代机制,如 - >
UserGroupInformation.loginUserFromKeytab(principal,keyTab)
connection = ConnectionFactory.createConnection(hbaseConf)
code>
请提出建议。
>你并不孤单,希望从Spark获得Kerberos身份验证给HBase,参见参考资料。
鲜为人知的事实是,Spark现在在启动时为Yarn,HDFS,Hive和HBase 生成 Hadoopauth令牌 。然后将这些令牌广播给执行者,以便他们不必再次使用Kerberos身份验证,密钥表等等。
第一个问题是它没有明确记录在案,如果失败,这些错误在默认情况下是隐藏的(例如,大多数人不会用Kerberos连接到HBase,所以通常没有任何说明HBase JAR不在CLASSPATH中,也没有创建HBase标记...通常。) 第三个问题是,在某些特定情况下(例如Cloudera发行版中的YARN群集模式),Spark属性 第四个问题是,即使HBase令牌已在启动时创建,那么执行者必须明确地使用它来进行验证。幸运的是,Cloudera为HBase贡献了一个Spark连接器,可以自动处理这种讨厌的东西。它现在是HBase客户端的一部分,默认情况下(参考 第五个问题是,AFAIK,如果您在CLASSPATH中没有
要记录有关这些标记的所有详细信息,必须为 org.apache.spark.deploy.yarn.Client设置日志级别第二个问题是,除了这些属性,Spark支持许多env变量,一些文档记录,一些没有记录,一些实际上已弃用。
例如, SPARK_CLASSPATH
现已弃用,其内容实际上已注入Spark属性 spark.driver
/ spark.executor.extraClassPath
。但是 SPARK_DIST_CLASSPATH
仍在使用中,例如在Cloudera发行版中,它用于注入核心Hadoop库&配置到Spark启动程序中,以便它可以在驱动程序启动之前(即在评估 spark.driver.extraClassPath
之前)引导YARN集群的执行。 >
其他感兴趣的变量是
$ ul
$ li $ H $ O $ C $ H $ O $ C $ H $
SPARK_CONF_DIR
SPARK_EXTRA_LIB_PATH
SPARK_SUBMIT_OPTS
SPARK_PRINT_LAUNCH_COMMAND
spark.yarn.tokens.hbase.enabled
默默设置为 false
- 这是绝对没有道理的,默认情况下在Spark源代码中硬编码为 true
...!
因此建议您明确强制它到 true
在你的工作配置中。
hbase-spark * .jar
)。
metrics-core * .jar
,那么HBase连接将会因为令人费解(和无关的)ZooKepper错误。
$ b
¤¤¤¤¤ 如何使用调试痕迹
#我们假设spark-env.sh和spark-default.conf已经支持Hadoop,
#还*几乎* HBase就绪(如在CDH发行版中);
#尤其是HADOOP_CONF_DIR和SPARK_DIST_CLASSPATH预计将被设置为
#,但spark。*。extraClassPath / .extraJavaOptions预计未设置
KRB_DEBUG_OPTS = - Dlog4j.logger.org .apache.spark.deploy.yarn.Client = DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper = DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager $ HConnectionImplementation = DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext = DEBUG -Dsun.security.krb5.debug = true -Djava.security.debug = gssloginconfig,configfile,configparser,logincontext
EXTRA_HBASE_CP = /等/ HBase的/ conf目录/中:/ opt / Cloudera公司/包裹/ CDH / lib中/ HBase的/ HBase的-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar
export SPARK_SUBMIT_OPTS =$ KRB_DEBUG_OPTS
export HADOOP_JAAS_DEBUG = true
export SPARK_PRINT_LAUNCH_COMMAND = True
spark-submit --master yarn-client \
--files/etc/spark/conf/log4j.properties#yarn-log4j.properties\\
--principal [email protected] --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled = true \
- -conf spark.driver.extraClassPath = $ EXTRA_HBASE_CP \
--conf spark.executor.extraClassPath = $ EXTRA_HBASE_CP \
--confspark.executor.extraJavaOptions = $ KRB_DEBUG_OPTS -Dlog4j.configuration = yarn-log4j.properties\
--conf spark.executorEnv.HADOOP_JAAS_DEBUG = true \
--class TestSparkHBase TestSparkHBase.jar
spark-submit --master yarn -cluster --conf spark.yarn.report.interval = 4000 \
--files/etc/spark/conf/log4j.properties#yarn-log4j.properties\
--principal [email protected] --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled = true \
--conf spark.driver.extraClassPath = $ EXTRA_HBASE_CP \
--confspark.driver.extraJavaOptions = $ KRB_DEBUG_OPTS -Dlog4j.configuration = yarn-log4j.properties\
--conf spark.driverEnv.HADOOP_JAAS_DEBUG = true \\
--conf spark.executor.extraClassPath = $ EXTRA_HBASE_CP \
--confspark.executor.extraJavaOptions = $ KRB_DEBUG_OPTS -Dlog4j.configuration = yarn-log4j.properties\
--conf spark.executorEnv.HADOOP_JAAS_DEBUG = true \
--class TestSparkHBase TestSparkHBase.jar
PS:当使用 HBaseContext
时,执行程序中不需要 / etc / hbase / conf /
CLASSSPATH,conf自动传播。
PPS:我建议你设置 log4j.logger.org.apache.zookeeper.ZooKeeper = WARN
在 log4j.properties
中,因为它冗长,无用,甚至令人困惑(所有有趣的东西都记录在HBase级别)
PPS:您可以静态列出 $ SPARK_CONF_DIR / log4j中的Log4J选项,而不是冗余
,其余部分位于 SPARK_SUBMIT_OPTS
var。属性 $ SPARK_CONF_DIR / java-opts
; $ SPARK_CONF_DIR / spark-defaults.conf
中的Spark属性以及 $ SPARK_CONF_DIR / spark-env.sh $ c中的env变量
¤¤¤¤¤ 关于HBase的Spark连接器
摘自 ,第83章 Basic Spark
文档中没有提及什么是 HBaseContext
自动使用HBase授权令牌(如果有的话)来验证执行者。
注还有一个例子(在Scala中,然后在Java中)使用 BufferedMutator $ c $>对RDD执行Spark
foreachPartition
操作c>用于将异步批量加载到HBase中。
I am submitting a job to YARN (on spark 2.1.1 + kafka 0.10.2.1) which connects to a secured hbase cluster. This job, performs just fine when i am running in "local" mode (spark.master=local[*]).
However, as soon as I submit the job with master as YARN (and deploy mode as client), I see the following error message -
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
I am following hortonworks recommendations for providing information to yarn cluster regarding the hbase and keytab etc. Followed this kb article - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html
Any pointers what could be going on ?
the mechanism for logging into hbase =>
UserGroupInformation.setConfiguration(hbaseConf)
val keyTab = "keytab-location")
val principal = "kerberos-principal"
val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keyTab)
UserGroupInformation.setLoginUser(ugi)
ugi.doAs(new PrivilegedExceptionAction[Void]() {
override def run: Void = {
hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
null
}
})
Also, I tried the alternative mechanism for logging in, as ->
UserGroupInformation.loginUserFromKeytab(principal, keyTab)
connection=ConnectionFactory.createConnection(hbaseConf)
please suggest.
You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279
A little-known fact is that Spark now generates Hadoop "auth tokens" for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don't have to mess again with Kerberos auth, keytabs, etc.
The first problem is that it's not explicitly documented, and in case of failure the errors are hidden by default (i.e. most people don't connect to HBase with Kerberos, so it's usually pointless to state that the HBase JARs are not in the CLASSPATH and no HBase token was created... usually.)
To log all details about these tokens, you have to set the log level for org.apache.spark.deploy.yarn.Client
to DEBUG.
The second problem is that beyond the properties, Spark supports many env variables, some documented, some not documented, and some actually deprecated.
For instance, SPARK_CLASSPATH
is now deprecated, and its content actually injected in Spark properties spark.driver
/ spark.executor.extraClassPath
.
But SPARK_DIST_CLASSPATH
is still in use, and in the Cloudera distro for example, it is used to inject the core Hadoop libs & config into the Spark "launcher" so that it can bootstrap a YARN-cluster execution, before the driver is started (i.e. before spark.driver.extraClassPath
is evaluated).
Other variables of interest are
HADOOP_CONF_DIR
SPARK_CONF_DIR
SPARK_EXTRA_LIB_PATH
SPARK_SUBMIT_OPTS
SPARK_PRINT_LAUNCH_COMMAND
The third problem is that, in some specific cases (e.g. YARN-cluster mode in the Cloudera distro), the Spark property spark.yarn.tokens.hbase.enabled
is set silently to false
-- which makes absolutely no sense, that default is hard-coded to true
in Spark source code...!
So you are advised to force it explicitly to true
in your job config.
The fourth problem is that, even if the HBase token has been created at startup, then the executors must explicitly use it to authenticate. Fortunately, Cloudera has contributed a "Spark connector" to HBase, to take care of this kind of nasty stuff automatically. It's now part of the HBase client, by default (cf. hbase-spark*.jar
).
The fifth problem is that, AFAIK, if you don't have metrics-core*.jar
in the CLASSPATH then the HBase connections will fail with puzzling (and unrelated) ZooKepper errors.
¤¤¤¤¤ How to make that stuff work, with debug traces
# we assume that spark-env.sh and spark-default.conf are already Hadoop-ready,
# and also *almost* HBase-ready (as in a CDH distro);
# especially HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH are expected to be set
# but spark.*.extraClassPath / .extraJavaOptions are expected to be unset
KRB_DEBUG_OPTS="-Dlog4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext=DEBUG -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
EXTRA_HBASE_CP=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar
export SPARK_SUBMIT_OPTS="$KRB_DEBUG_OPTS"
export HADOOP_JAAS_DEBUG=true
export SPARK_PRINT_LAUNCH_COMMAND=True
spark-submit --master yarn-client \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal [email protected] --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
spark-submit --master yarn-cluster --conf spark.yarn.report.interval=4000 \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal [email protected] --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.driver.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.driverEnv.HADOOP_JAAS_DEBUG=true \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
PS: when using a HBaseContext
you don't need /etc/hbase/conf/
in the executor's CLASSPATH, the conf is propagated automatically.
PPS: I advise you to set log4j.logger.org.apache.zookeeper.ZooKeeper=WARN
in log4j.properties
because it's verbose, useless, and even confusing (all the interesting stuff is logged at HBase level)
PPS: instead of that verbose SPARK_SUBMIT_OPTS
var, you could also list statically the Log4J options in $SPARK_CONF_DIR/log4j.properties
and the rest in $SPARK_CONF_DIR/java-opts
; same goes for the Spark properties in $SPARK_CONF_DIR/spark-defaults.conf
and env variables in $SPARK_CONF_DIR/spark-env.sh
¤¤¤¤¤ About the "Spark connector" to HBase
Excerpt from the official HBase documentation, chapter 83 Basic Spark
What is not mentioned in the doc is that the HBaseContext
uses automatically the HBase "auth token" (when present) to authenticate the executors.
Note also that the doc has an example (in Scala then in Java) of a Spark foreachPartition
operation on a RDD, using a BufferedMutator
for async bulk load into HBase.
这篇关于Spark on YARN + Secured hbase的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!