问题描述
我在AWS上有一个集群,其中有2个从属和1个主控.所有实例的类型均为m1.large.我正在运行Spark版本1.4.我正在对来自红移的超过4m数据的火花性能进行基准测试.我通过pyspark shell触发了一个查询
i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell
df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
df.registerTempTable('test')
d=sqlContext.sql("""
select user_id from (
select -- (i1)
sum(total),
user_id
from
(select --(i2)
avg(total) as total,
user_id
from
test
group by
order_id,
user_id) as a
group by
user_id
having sum(total) > 0
) as b
"""
)
当我执行d.count()时,上面的查询在未缓存df
时需要30秒,在df
缓存在内存中时需要17秒.
When i do d.count(), the above query takes 30 sec when df
is not cached and 17sec when df
is cached in memory.
我希望这些时间接近1-2s.
I'm expecting these timings to be closer to 1-2s.
这些是我的火花配置:
spark.executor.memory 6154m
spark.driver.memory 3g
spark.shuffle.spill false
spark.default.parallelism 8
rest设置为其默认值.有人可以看到我在这里想念的东西吗?
rest is set to its default values. Can any one see what i'm missing here ?
推荐答案
这是正常现象,除了Spark可以像mysql或postgres一样在几毫秒内运行.与Hive,Impala等其他大数据解决方案相比,Spark具有较低的延迟...您无法将其与经典数据库进行比较,Spark并不是对数据建立索引的数据库!
This is normal, don't except Spark to run in a few milli-secondes like mysql or postgres do. Spark is low latency compared to other big data solutions like Hive, Impala... you cannot compare it with classic database, Spark is not a database where data are indexed!
观看此视频: https://www.youtube.com/watch?v=8E0cVWKiuhk
他们显然将Spark放在这里:
They clearly put Spark here:
您尝试过Apache Drill吗?我发现它快一点(我将它用于2/3Gb的小型HDFS JSON文件,比用于SQL查询的Spark快得多).
Did you try Apache Drill? I found it a bit faster (I use it for small HDFS JSON files, 2/3Gb, much faster than Spark for SQL queries).
这篇关于Spark查询运行非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!