Spark查询运行非常慢

本文介绍了Spark查询运行非常慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在AWS上有一个集群，其中有2个从属和1个主控.所有实例的类型均为m1.large.我正在运行Spark版本1.4.我正在对来自红移的超过4m数据的火花性能进行基准测试.我通过pyspark shell触发了一个查询

i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell

    df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
    df.registerTempTable('test')
    d=sqlContext.sql("""

    select user_id from (

    select -- (i1)

        sum(total),

        user_id

    from

        (select --(i2)

            avg(total) as total,

            user_id

        from

                test

        group by

            order_id,

            user_id) as a

    group by

        user_id

    having sum(total) > 0

    ) as b
"""
)

当我执行d.count()时，上面的查询在未缓存df时需要30秒，在df缓存在内存中时需要17秒.

When i do d.count(), the above query takes 30 sec when df is not cached and 17sec when df is cached in memory.

我希望这些时间接近1-2s.

I'm expecting these timings to be closer to 1-2s.

这些是我的火花配置:

spark.executor.memory 6154m
spark.driver.memory 3g
spark.shuffle.spill false
spark.default.parallelism 8

rest设置为其默认值.有人可以看到我在这里想念的东西吗?

rest is set to its default values. Can any one see what i'm missing here ?

not

Spark查询运行非常慢

问题描述

推荐答案