本文介绍了在spark数据框之间的联接中包含列时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我使用array_contains
在 cleanDF
和 sentiment_df
之间进行了联接,效果很好()以包含 cleanDF
中的 Year
列:
I add a .select
(36132322) to include the column Year
from cleanDF
:
Result1 = cleanDF.alias('a').join(sentiment_df.alias('b'), expr("""array_contains(a.MeaningfulWords,b.word)"""), how='left')\
.select(col('a.ID'),col('a.Year'),col('a.MeaningfulWords'),col('b.word'),col('b.score'))\
.groupBy("ID")\
.agg(first("a.MeaningfulWords").alias("MeaningfulWords")\
,collect_list("score").alias("ScoreList")\
,mean("score").alias("MeanScore"))
但是我进入 Result1
与**Result**
相同的列:
But I get in Result1
the same columns than **Result**
:
display(Result1)
#DataFrame[ID: string, MeaningfulWords: array<string>, ScoreList: array<double>, MeanScore: double]
当我尝试在.agg
函数中包含 Year
时:
When I'm try include Year
in .agg
function:
Result2 = cleanDF.join(sentiment_df, expr("""array_contains(MeaningfulWords,word)"""), how='left')\
.groupBy("ID")\
.agg(first("MeaningfulWords").alias("MeaningfulWords"),first("Year").alias("Year")\
,collect_list("score").alias("ScoreList")\
,mean("score").alias("MeanScore"))
Result2.show()
Py4JJavaError: An error occurred while calling o3205.showString.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doExecute(BroadcastNestedLoopJoinExec.scala:343)
...
...
...
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1066)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:109)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
...
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 411.0 failed 1 times, most recent failure: Lost task 2.0 in stage 411.0 (TID 9719, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$5: (array<string>) => array<string>)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1066)
at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction$class.eval(higherOrderFunctions.scala:208)
at org.apache.spark.sql.catalyst.expressions.ArrayFilter.eval(higherOrderFunctions.scala:296)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
...
...
... 20 more
Caused by: java.lang.NullPointerException
我在Spark 2.4.5上使用pyspark.
Im using pyspark on spark 2.4.5.
预先感谢您的帮助.
推荐答案
年份"列可能具有空值&因此,它失败并出现Caused by: java.lang.NullPointerException
异常.过滤Year
列中的所有空值.
Year column might be having null values & because of that it is failing with Caused by: java.lang.NullPointerException
exception. Filter all null values from Year
column.
这篇关于在spark数据框之间的联接中包含列时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!