本文介绍了Spark 1.3.1在Java中将数据映射到key-value []对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的平面文件具有以下结构:

I have flat file with the following structure:

key1|"value-001"
key2|"value-002"
key2|"value-003"
key3|"value-004"
key2|"value-005"
key1|"value-006"
key3|"value-007"

我需要将此数据文件映射到键值对,其中value将是一个键的值列表,例如:

I need to map this data file to key-value pairs where value will be list of values for one key, such as:

key1:["value-001","value-006"]
key2:["value-002","value-003","value-005"]
key3:["value-004","value-007"]

我需要从Java代码中执行此操作.据我从Spark编程指南了解,此操作应通过 sc.flatMapValues(..) sc.flatMap(..) sc.groupByKey(..),但我不知道是哪一个.我该怎么做?

I need do this from Java code. As I understood from Spark Programming Guide this operation should be implemented by sc.flatMapValues(..), sc.flatMap(..) or sc.groupByKey(..) but I don't know which one. How do I do this?

推荐答案

我建议 reduceByKey :)此列表模仿您的输入:

I would recommend reduceByKey:) This list imitates your input:

List<String> input = Arrays.asList(
  new String[]{new String("key1|value-001"),
               new String("key2|value-002"),
               new String("key2|value-003"),
               new String("key3|value-004"),
               new String("key2|value-005"),
               new String("key1|value-006"),
               new String("key3|value-007")});

转换为rdd(当然,您只需使用 sc.textFile()读入文件)

Converting to rdd (you will of course just read in your file with sc.textFile())

JavaRDD<String> rdd = javaSparkContext.parallelize(input);

我们现在有一个字符串的RDD.以下内容映射到键值对(请注意,该值已添加到列表中),然后 reduceByKey 将每个键的所有值组合到列表中,以产生所需的结果.

We now have an RDD of strings. The following maps to key-value pairs (note the value is being added to a list) and then reduceByKey combines all values for each key into a list, yielding the result you want.

JavaPairRDD<String, List<String>> keyValuePairs = rdd.mapToPair(obj -> {
        String[] split = obj.split("|");
        return new Tuple2(split[0], Arrays.asList(new String[]{split[1]}));
    });

JavaPairRDD<String, List<String>> result = keyValuePairs.reduceByKey((v1, v2) -> {
        v1.addAll(v2);
        return v1;
    });

我觉得我应该提到您也可以使用 groupByKey .但是,您通常要优先使用 reduceByKey 而不是 groupByKey ,因为 reduceByKey 确实会在地图侧减少数据混排之前的位置,而 groupByKey随机播放周围的所有内容.在您的特定情况下,您可能会希望使用 groupByKey 来处理相同数量的数据,因为您希望收集所有值,但是使用 reduceByKey 只是一种养成更好的习惯:)

I feel I should mention that you could also use a groupByKey. However, you usually want to favor reduceByKey over groupByKey because reduceByKey does a map-side reduce BEFORE shuffling the data around, whereas groupByKey shuffles everything around. In your particular case, you will probably end up shuffling the same amount of data around as with a groupByKey since you want all values to be gathered, but using reduceByKey is just a better habit to be in :)

这篇关于Spark 1.3.1在Java中将数据映射到key-value []对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-22 16:27