本文介绍了Python-PySpark的Pickle Spacy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
Spacy 2.0的文档提到开发人员已添加功能以允许对Spacy进行腌制,因此可以由PySpark连接的Spark群集使用它,但是,他们没有提供有关如何执行此操作的说明.
The documentation for Spacy 2.0 mentions that the developers have added functionality to allow for Spacy to be pickled so that it can be used by a Spark Cluster interfaced by PySpark, however, they don't give instructions on how to do this.
有人可以解释我如何腌制Spacy的英语NE解析器以用于udf函数吗?
Can someone explain how I can pickle Spacy's English-language NE parser to be used inside of my udf functions?
这不起作用:
from pyspark import cloudpickle
nlp = English()
pickled_nlp = cloudpickle.dumps(nlp)
推荐答案
这并不是真正的答案,但我发现了最好的解决方法:
Not really an answer, but the best workaround I've discovered:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy
def get_entities_udf():
def get_entities(text):
global nlp
try:
doc = nlp(unicode(text))
except:
nlp = spacy.load('en')
doc = nlp(unicode(text))
return [t.label_ for t in doc.ents]
res_udf = udf(get_entities, StringType(ArrayType()))
return res_udf
documents_df = documents_df.withColumn('entities', get_entities_udf()('text'))
这篇关于Python-PySpark的Pickle Spacy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!