本文介绍了Python-PySpark的Pickle Spacy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spacy 2.0的文档提到开发人员已添加功能以允许对Spacy进行腌制,因此可以由PySpark连接的Spark群集使用它,但是,他们没有提供有关如何执行此操作的说明.

The documentation for Spacy 2.0 mentions that the developers have added functionality to allow for Spacy to be pickled so that it can be used by a Spark Cluster interfaced by PySpark, however, they don't give instructions on how to do this.

有人可以解释我如何腌制Spacy的英语NE解析器以用于udf函数吗?

Can someone explain how I can pickle Spacy's English-language NE parser to be used inside of my udf functions?

这不起作用:

from pyspark import cloudpickle
nlp = English()
pickled_nlp = cloudpickle.dumps(nlp)

推荐答案

这并不是真正的答案,但我发现了最好的解决方法:

Not really an answer, but the best workaround I've discovered:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy

def get_entities_udf():
    def get_entities(text):
        global nlp
        try:
            doc = nlp(unicode(text))
        except:
            nlp = spacy.load('en')
            doc = nlp(unicode(text))
        return [t.label_ for t in doc.ents]
    res_udf = udf(get_entities, StringType(ArrayType()))
    return res_udf

documents_df = documents_df.withColumn('entities', get_entities_udf()('text'))

这篇关于Python-PySpark的Pickle Spacy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 13:43