问题描述
在我的spark应用程序中,我使用以下代码使用JDBC驱动程序从sql server数据库中检索数据.
In my spark application, i use the following code to retrieve the data from sql server database using JDBC driver.
Dataset<Row> dfResult= sparksession.read().jdbc("jdbc:sqlserver://server\dbname", tableName,partitionColumn, lowerBound, upperBound, numberOfPartitions, properties);
并在dfResult数据集上使用地图操作.
and use map operation on dfResult dataset.
以独立模式运行应用程序时,我看到spark为每个rdd创建了唯一的连接.从Api描述中,我了解spark负责关闭连接.
While running the application in standalone mode, i see spark creates unique connection for each rdd.From the Api description, I understand spark takes care of closing the connection.
我可以知道是否有一种方法可以重用该连接,而不是为每个rdd分区打开和关闭jdbc连接吗?
May i know whether there is a way to reuse the connection instead of opening and closing the jdbc connection for each rdd partition?
谢谢
推荐答案
即使您通过API将数据手动推送到数据库中,我也经常看到建议:您为每个分区创建一个连接
Even when you're pushing data manually into a database over an API, I often see recommendations that you create one connection per partition.
# pseudo-code
rdd.foreachPartition(iterator =>
connection = SomeAPI.connect()
for i in iterator:
connection.insert(i)
)
因此,如果jdbc对象已经在执行此操作,则必须确认该模式应该是这种方式.
And so, if the jdbc object is already doing that, then that must be confirming that the pattern should be that way.
这里是推荐这种模式的另一个示例:
Here's another example of this pattern being recommended:
我认为这是推荐的模式的原因是,当您在多节点集群中工作时,您不知道特定分区将在哪个节点上进行评估,因此,您需要确保对其进行评估.有一个数据库连接.
I presume the reason why this is the recommended pattern is because when you're working in a multi-node cluster, you never know on which node a particular partition will be evaluated, and thus, you'd want to ensure it has a DB connection for it.
这篇关于Spark jdbc重用连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!