问题描述
我想在 PySpark 中使用 JDBC 连接到 Presto 服务器.我遵循了用 Java 编写的 教程.我试图在我的 Python3 代码中做同样的事情,但出现错误:
I want to connect to Presto server using JDBC in PySpark. I followed a tutorial which is written in Java. I am trying to do the same in my Python3 code but getting an error:
: java.sql.SQLException: 没有合适的驱动程序
我尝试执行以下代码:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:presto://my_machine_ip:8080/hive/default") \
.option("user", "airflow") \
.option("dbtable", "may30_1") \
.load()
应该注意的是,我在 EMR 上使用 Spark,因此,spark
已经提供给我.
It should be noted that I am using Spark on EMR and so, spark
is already provided to me.
上述教程中的代码是:
final String JDBC_DRIVER = "com.facebook.presto.jdbc.PrestoDriver";
final String DB_URL = "jdbc:presto://localhost:9000/catalogName/schemaName";
// Database credentials
final String USER = "username";
final String PASS = "password";
Connection conn = null;
Statement stmt = null;
try {
//Register JDBC driver
Class.forName(JDBC_DRIVER);
请注意上面代码中的JDBC_DRIVER
,我无法在Python3中推导出相应的赋值即在PySpark中.
Kindly notice the JDBC_DRIVER
in the above code, I have not been able to deduce the corresponding assignment in Python3 i.e. in PySpark.
我也没有在任何配置中添加任何依赖项.
Nor have I added any dependency in any configuration whatsoever.
我希望能成功连接到我的 presto.现在,我收到以下错误,完整的堆栈跟踪:
I expect to successfully connect to my presto. Right now, I am getting the following error, complete stacktrace:
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o90.load.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:104)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
推荐答案
您需要执行以下步骤将 presto 连接到 pyspark.
You need to perform the following steps for connecting presto to pyspark.
下载Presto JDBC driver jar 来自官网并把它进入火花主节点.
Download Presto JDBC driver jar from official website and put it into spark master node.
使用以下参数启动 pyspark shell:
Start pyspark shell with following params:
bin/pyspark --driver-class-path com.facebook.presto.jdbc.PrestoDriver --jars /path/to/presto/jdbc/driver/jar/file
- 尝试使用 spark 连接到 presto
jdbcDF = spark.read \
.format("jdbc") \
.option("driver", "com.facebook.presto.jdbc.PrestoDriver") \ <-- presto driver class
.option("url", "jdbc:presto://<machine_ip>:8080/hive/default") \
.option("user", "airflow") \
.option("dbtable", "may30_1") \
.load()
这篇关于如何在 PySpark 中连接到 Presto JDBC?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!