python - Pyspark倒排索引

我正在创建文档的反向索引，其中输出应包含一个单词（来自文本文件），后跟出现的所有文件。类似

  [word1：file1.txt file2.txt] [word2：file2.txt file3.txt]

我已经编写了代码，但是它引发了这个错误。

  对于迭代器中的k，v：
  TypeError：（）恰好接受2个参数（给定1个）

码：

from pyspark import SparkContext
sc = SparkContext("local", "app")

path = '/ebooks'
rdd = sc.wholeTextFiles(path)

output = rdd.flatMap(lambda (file,contents):contents.lower().split())\
            .map(lambda file,word: (word,file))\
            .reduceByKey(lambda a,b: a+b)
print output.take(10)

我无法弄清楚在地图中同时发出键和值（单词和文件名）的方法。我该怎么办？

在mapreduce中，可以发出（word，key）对（key是文件名），但是如何在spark中完成呢？

最佳答案

我尚未在伪数据上对此进行测试，但是查看您的代码，我认为以下修改应该有效：

output = rdd.flatMap(lambda (file,contents):[(file, word) for word in contents.lower().split()])\
      .map(lambda (file, word): (word,[file]))\
      .reduceByKey(lambda a,b: a+b)

关于python - Pyspark倒排索引，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47657531/

Word

python - Pyspark倒排索引