问题描述
默认情况下,spark_read_jdbc()
将整个数据库表读入Spark.我使用以下语法创建这些连接.
By default, spark_read_jdbc()
reads an entire database table into Spark. I've used the following syntax to create these connections.
library(sparklyr)
library(dplyr)
config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar"
sc <- spark_connect(master = "local",
version = "1.6.0",
hadoop_version = 2.4,
config = config)
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "table_name"))
但是,我现在遇到了这样的情况:我在MySQL数据库中有一个表,我希望只将该表的一部分读入Spark.
However, I've now encountered the scenario where I have a table in a MySQL database and I would prefer to only read in a subset of this table into Spark.
如何获取spark_read_jdbc
来接受谓词?我尝试将谓词添加到选项列表中没有成功,
How do I get spark_read_jdbc
to accept a predicate? I've tried adding the predicate to the options list without success,
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "table_name",
predicates = "field > 1"))
推荐答案
您可以将dbtable
替换为查询:
db_tbl <- sc %>%
spark_read_jdbc(sc = .,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "(SELECT * FROM table_name WHERE field > 1) as my_query"))
但在这种简单情况下,Spark会在过滤时自动将其推送:
but with simple condition like this Spark should push it automatically when you filter:
db_tbl %>% filter(field > 1)
只需确保设置:
memory = FALSE
在spark_read_jdbc
中.
这篇关于从JDBC连接读取时如何使用谓词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!