问题描述
我正在使用 SparkR
和 sparklyr
导入数据块中的镶木文件。
I am importing parquet files in databricks using SparkR
and sparklyr
.
data1 = SparkR :: read.df( dbfs:/.../ data202007 *,源= parquet,标头= TRUE,inferSchema = TRUE)
data1 = sparklyr :: spark_read_parquet(sc = sc,path = dbfs:/.../ data202007 *)
导入时间差很大: SparkR
为6秒,而为11分钟火花
!
是否可以减少在 sparklyr
中花费的时间?我对 dplyr
语法更熟悉,因此对 sparklyr
也很熟悉。
The time difference for import is humongous: 6 seconds for SparkR
vs 11 minutes for sparklyr
!Is there a way to reduce the time taken in sparklyr
? I am more familiar with dplyr
syntax and therefore sparklyr
as well.
推荐答案
默认情况下 sparklyr :: spark_read_parquet
缓存结果( memory = TRUE
)。
By default sparklyr::spark_read_parquet
caches the results (memory = TRUE
).
比较以下各项以获取缓存的结果:
Compare the following for cached results:
SparkR::cache(SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE))
sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*")
这是未缓存的:
SparkR::read.df("dbfs:/.../data202007*", source = "parquet", header = TRUE, inferSchema = TRUE)`
sparklyr::spark_read_parquet(sc = sc, path = "dbfs:/.../data202007*", memory = FALSE)
这篇关于SparkR和Sparklyr之间导入镶木地板文件所花费的时间差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!