问题描述
我是 pyspark
的新手,想将我现有的 pandas
/python
代码转换为 PySpark
.
I am brand new to pyspark
and want to translate my existing pandas
/ python
code to PySpark
.
我想对我的 dataframe
进行子集化,以便只返回包含我在 'original_problem'
字段中查找的特定关键字的行.
I want to subset my dataframe
so that only rows that contain specific key words I'm looking for in 'original_problem'
field is returned.
以下是我在 PySpark 中尝试的 Python 代码:
Below is the Python code I tried in PySpark:
def pilot_discrep(input_file):
df = input_file
searchfor = ['cat', 'dog', 'frog', 'fleece']
df = df[df['original_problem'].str.contains('|'.join(searchfor))]
return df
当我尝试运行上述程序时,出现以下错误:
When I try to run the above, I get the following error:
AnalysisException: u"无法从 original_problem#207 中提取值:需要结构类型但得到字符串;"
推荐答案
在 pyspark 中,试试这个:
In pyspark, try this:
df = df[df['original_problem'].rlike('|'.join(searchfor))]
或等效地:
import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))
或者,您可以选择 udf
:
import pyspark.sql.functions as F
searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')
df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')
但首选 DataFrame 方法,因为它们会更快.
But the DataFrame methods are preferred because they will be faster.
这篇关于PySpark:在文本和子集数据框中搜索子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!