广播惹恼对象火花（最近邻）？

本文介绍了广播惹恼对象火花（最近邻）？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

火花的mllib没有近邻的功能，我尝试使用近似最近邻居。我尝试播放惹恼对象，并把它传递给工人;然而，因为预期它不工作

下面是code可重复性（在PySpark运行）。问题是突出使用与惹恼不VS星火时看到的区别。

 从搅扰进口AnnoyIndex
进口随机
random.seed（42）F = 40
项目载体T = AnnoyIndex（F）＃长度将被编入索引
allvectors = []
在的xrange I（20）：
    V = [random.gauss（0,1），用于在的xrange Z（F）]
    t.add_item（I，V）
    allvectors.append（（I，V））
t.build（10）＃10棵树＃使用惹恼星火
sparkvectors = sc.parallelize（allvectors）
BCT = sc.broadcast（T）
X = sparkvectors.map（拉姆达X：bct.value.get_nns_by_vector（矢量= X [1]中，n = 5））
打印与星火第一矢量五近邻，
打印x.first（）＃使用惹恼没有星火
打印对于没有星火第一矢量五近邻，
打印（t.get_nns_by_vector（矢量= allvectors [0] [1]中，n = 5））

输出看出：

解决方案

I've never used Annoy but I am pretty sure that the package description explains what is going on here:

Since it is using memory mapped indexes when you serialize it and pass it to the workers all data is lost on the way.

Try something like this instead:

from pyspark import SparkFiles

t.save("index.ann")
sc.addPyFile("index.ann")

def find_neighbors(iter):
    t = AnnoyIndex(f)
    t.load(SparkFiles.get("index.ann"))
    return (t.get_nns_by_vector(vector=x[1], n=5) for x in iter)

sparkvectors.mapPartitions(find_neighbors).first()
## [0, 13, 12, 6, 4]

这篇关于广播惹恼对象火花（最近邻）？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！