火花马preduce了解组

火花马preduce了解组

本文介绍了火花马preduce了解组(蟒蛇)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想,我正在考虑雇员的数据集,并试图计算分布在各个部门的工资总和的小程序。我有一个重复的例子。任何人都可以请给我解释一下发生了什么。

I am trying a small program where I am considering an employee dataset and trying to calculate sum of salaries distributed in various departments. I have a reproducible example. Can anybody please explain me what is happening.

 emp_list=[(u'ACC', [u'101', u'a', u'ACC', u'1000']),
 (u'SALES', [u'102', u'b', u'SALES', u'2000']),
 (u'IT', [u'103', u'c', u'IT', u'3000']),
 (u'ACC', [u'104', u'd', u'ACC', u'4000']),
 (u'ACC', [u'105', u'e', u'ACC', u'5000']),
 (u'HR', [u'106', u'f', u'HR', u'6000']),
 (u'ACC', [u'107', u'g', u'ACC', u'7000']),
 (u'FIN', [u'108', u'h', u'FIN', u'8000']),
 (u'ACC', [u'109', u'k', u'ACC', u'9000']),
 (u'HR', [u'1010', u'l', u'HR', u'10000']),
 (u'ACC', [u'1011', u'm', u'ACC', u'11000']),
 (u'ACC', [u'1012', u'n', u'ACC', u'12000']),
 (u'FIN', [u'1013', u'o', u'FIN', u'13000']),
 (u'IT', [u'1014', u'p', u'IT', u'14000'])]

emp=sc.parallelize(emp_list)

emp.reduceByKey(lambda x,y : x[3]+y[3]).take(10)

输出我得到的是:

[(u'ACC', u'00'),
 (u'HR', u'600010000'),
 (u'FIN', u'800013000'),
 (u'SALES', [u'102', u'b', u'SALES', u'2000']),
 (u'IT', u'300014000')]

任何人能请解释一下我为什么,我为 ACC 销售部门越来越奇怪值。我希望看到这两个以及串联的工资。

Can anybody please explain me why I am getting strange values for ACC and SALES departments. I want to see the concatenated salaries for these two as well.

推荐答案

您得到奇怪的值,因为你的函数的逻辑是无效的。如果你使用Scala的,而不是Python的这甚至不会编译。当您应用 reduceByKey LHS和RHS和返回类型应该是同一类型的:

You get strange values because logic of your function is invalid. If you use Scala instead of Python this wouldn't even compile. When you apply reduceByKey LHS and RHS and return type should be of the same type:

reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

FUNC 应该是关联的。

在你的情况类型不匹配(输入列表,返回类型为字符串)和功能并不关联。为了理解正在发生的事情让我们考虑两种不同的情况:

In your case types don't match (input is a list and return type is a string) and function is not associative. To understand what is going on lets consider two different cases:


  1. 只有一个每个键值。由于 FUNC 不适用你得到这个值作为输出。因此,(u'SALES',[u'102',u'b',u'SALES',u'2000'])

  1. Only one value per key. Since func is not applied you get this value as an output. Hence (u'SALES', [u'102', u'b', u'SALES', u'2000'])

每个键多个值。让我们值的子集从 ACC 为例,假设如下的操作顺序定义

Multiple values per key. Lets take a subset of values from ACC as an example and assume order of operations is defined as follows

(
  # 1st partition
  ([u'101', u'a', u'ACC', u'1000'], [u'104', u'd', u'ACC', u'4000']),
  # 2nd partition
  ([u'105', u'e', u'ACC', u'5000'], [u'107', u'g', u'ACC', u'7000'])
)

后的第一个应用程序的 FUNC 我们得到:

(
   u'10004000',
   ([u'105', u'e', u'ACC', u'5000'], [u'107', u'g', u'ACC', u'7000'])
)

后的第二个应用程序的 FUNC 我们得到

(
   u'10004000',
   u'50007000'
)

最后

u'00'

在实践中的圆括号可以根据配置,这样就可以得到不同的输出有所不同。

In practice parenthesizing can vary depending on configuration so you can get different outputs.

要得到你应该使用正确的结果 aggregateByKey / combineByKey 地图 + 减少按@alexs的建议或地图然后按 groupByKey mapValues​​ 。最后一个应当是最有效的方法在这里,因为它不需要中间对象

To get correct results you should use aggregateByKey / combineByKey, map + reduce as suggested by @alexs or map followed by groupByKey and mapValues. The last one should be the most efficient approach here since it doesn't require intermediate objects:

emp.mapValues(lambda x: x[3]).groupByKey().mapValues(lambda xs: "".join(xs))

有关使用引用同样的事情 aggregateByKey

For reference the same thing using aggregateByKey:

from operator import add

rdd.aggregateByKey("", lambda acc, x: acc + x[3], add)

这篇关于火花马preduce了解组(蟒蛇)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:11