本文介绍了PySpark:类型错误:条件应该是字符串或列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试过滤基于如下的 RDD:
I am trying to filter an RDD based like below:
spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
但出现以下错误:
TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
3 spark_df.take(5)
/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
904 jdf = self._jdf.filter(condition._jc)
905 else:
--> 906 raise TypeError("condition should be string or Column")
907 return DataFrame(jdf, self.sql_ctx)
908
TypeError: condition should be string or Column
知道我错过了什么吗?谢谢!
Any idea what I missed? Thank you!
推荐答案
DataFrame.filter
,它是 DataFrame.where
的别名,需要一个 SQL 表达式表达作为 Column
:
DataFrame.filter
, which is an alias for DataFrame.where
, expects a SQL expression expressed either as a Column
:
spark_df.filter(col("target").like("good%"))
或等效的 SQL 字符串:
or equivalent SQL string:
spark_df.filter("target LIKE 'good%'")
我相信你在这里尝试使用 RDD.filter
这是完全不同的方法:
I believe you're trying here to use RDD.filter
which is completely different method:
spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
并且不会从 SQL 优化中受益.
and does not benefit from SQL optimizations.
这篇关于PySpark:类型错误:条件应该是字符串或列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!