问题描述
首先,如果我的问题很简单,我深表歉意.我确实花了很多时间研究它.
First off, I apologize if my issue is simple. I did spend a lot of time researching it.
我正在尝试按照此处.
这是我的代码:
from pyspark import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SQLContext
sc.install_pypi_package("pandas")
import pandas as pd
sc.install_pypi_package("PyArrow")
df = spark.createDataFrame(
[("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
("key", "value1", "value2")
)
df.show()
@F.pandas_udf("double", F.PandasUDFType.SCALAR)
def pandas_plus_one(v):
return pd.Series(v + 1)
df.select(pandas_plus_one(df.value1)).show()
# Also fails
#df.select(pandas_plus_one(df["value1"])).show()
#df.select(pandas_plus_one("value1")).show()
#df.select(pandas_plus_one(F.col("value1"))).show()
脚本在最后一条语句处失败:
The script fails at the last statement:
我在这里想念什么?我只是在遵循手册.谢谢您的帮助
What am I missing here? I am just following the manual. Thanks for your help
推荐答案
Pyarrow于2019年10月5日推出了新版本0.15,这导致熊猫Udf引发错误.Spark需要升级才能与此兼容(这可能需要一些时间).您可以在此处 https://issues中了解进度.apache.org/jira/projects/SPARK/issues/SPARK-29367?filter=allissues
Pyarrow rolled out a new version 0.15 on october 5,2019 which causes pandas Udf to throw error.Spark needs to upgrade to be compatible with this(which might take some time).You can follow the progress here https://issues.apache.org/jira/projects/SPARK/issues/SPARK-29367?filter=allissues
解决方案:
- 您需要安装Pyarrow 0.14.1或更低版本.<sc.install_pypi_package("pyarrow == 0.14.1")>(或)
- 在使用Python的位置设置环境变量 ARROW_PRE_0_15_IPC_FORMAT = 1
- You need to install Pyarrow 0.14.1 or lower. < sc.install_pypi_package("pyarrow==0.14.1") >(or)
- Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
这篇关于 pandas 标量UDF失败,IllegalArgumentException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!