本文介绍了如何将pyspark dataframe列拆分为仅两列(下面的示例)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
该列在一行中多次使用定界符,因此 split
并不那么简单.
拆分后,在这种情况下,只需考虑出现第一个定界符.
The column has multiple usage of the delimiter in a single row, hence split
is not as straightforward.
Upon splitting, only the 1st delimiter occurrence has to be considered in this case.
截至目前,我正在这样做.
As of now, I am doing this.
但是,我觉得有更好的解决方案?
testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])
testdf.show()
+------+---------------+
|Animal| Food|
+------+---------------+
| Dog|meat,bread,milk|
| Cat| mouse,fish|
+------+---------------+
testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
.withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
.withColumn("Food2",expr("substring(Food2, 2)")).show()
+------+---------------+-----+----------+
|Animal| Food|Food1| Food2|
+------+---------------+-----+----------+
| Dog|meat,bread,milk| meat|bread,milk|
| Cat| mouse,fish|mouse| fish|
+------+---------------+-----+----------+
推荐答案
使用正则表达式从列表中仅拆分首次出现的方法
An approach using regular expression to split only first occurrence from the list
testdf.withColumn('Food1',f.split('Food',"(?<=^[^,]*)\\,")[0]).\
withColumn('Food2',f.split('Food',"(?<=^[^,]*)\\,")[1]).show()
+------+---------------+-----+----------+
|Animal| Food|Food1| Food2|
+------+---------------+-----+----------+
| Dog|meat,bread,milk| meat|bread,milk|
| Cat| mouse,fish|mouse| fish|
+------+---------------+-----+----------+
这篇关于如何将pyspark dataframe列拆分为仅两列(下面的示例)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!