Spark jdbc重用连接

本文介绍了Spark jdbc重用连接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的spark应用程序中，我使用以下代码使用JDBC驱动程序从sql server数据库中检索数据.

In my spark application, i use the following code to retrieve the data from sql server database using JDBC driver.

 Dataset<Row> dfResult= sparksession.read().jdbc("jdbc:sqlserver://server\dbname", tableName,partitionColumn, lowerBound, upperBound, numberOfPartitions, properties);

并在dfResult数据集上使用地图操作.

and use map operation on dfResult dataset.

以独立模式运行应用程序时，我看到spark为每个rdd创建了唯一的连接.从Api描述中，我了解spark负责关闭连接.

While running the application in standalone mode, i see spark creates unique connection for each rdd.From the Api description, I understand spark takes care of closing the connection.

我可以知道是否有一种方法可以重用该连接，而不是为每个rdd分区打开和关闭jdbc连接吗?

May i know whether there is a way to reuse the connection instead of opening and closing the jdbc connection for each rdd partition?

谢谢

推荐答案

即使您通过API将数据手动推送到数据库中，我也经常看到建议:您为每个分区创建一个连接

Even when you're pushing data manually into a database over an API, I often see recommendations that you create one connection per partition.

# pseudo-code
rdd.foreachPartition(iterator =>
  connection = SomeAPI.connect()
  for i in iterator:
    connection.insert(i)
)

因此，如果jdbc对象已经在执行此操作，则必须确认该模式应该是这种方式.

And so, if the jdbc object is already doing that, then that must be confirming that the pattern should be that way.

这里是推荐这种模式的另一个示例:

Here's another example of this pattern being recommended:

http: //www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs (幻灯片27 )

我认为这是推荐的模式的原因是，当您在多节点集群中工作时，您不知道特定分区将在哪个节点上进行评估，因此，您需要确保对其进行评估.有一个数据库连接.

I presume the reason why this is the recommended pattern is because when you're working in a multi-node cluster, you never know on which node a particular partition will be evaluated, and thus, you'd want to ensure it has a DB connection for it.

这篇关于Spark jdbc重用连接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！