本文介绍了在pyspark中拆分列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试在pyspark中拆分数据帧这是我拥有的数据
I am trying to split a dataframe in pysparkThis is the data i have
df = sc.parallelize([[1, 'Foo|10'], [2, 'Bar|11'], [3,'Car|12']]).toDF(['Key', 'Value'])
df = df.withColumn('Splitted', split(df['Value'], '|')[0])
我知道了
+-----+---------+-----+
|Key|Value|Splitted |
+-----+---------+-----+
| 1| Food|10| F|
| 2| Bar|11 | B|
| 3| Caring 12| C|
+-----+---------+-----+
但是我想要
+-----+---------+-----+
|Key | Value|Splitted|
+-----+---------+-----+
| 1| 10| Food |
| 2| 11| Bar |
| 3| 12|Caring |
+-----+---------+-----+
有人可以指出我做错了什么吗?
Can any one please point me to what i am doing wrong?
What if i have a unique situation like this?
df = sc.parallelize([[1, 'Foo|10|we'], [2, 'Bar|11|we'], [3,'Car|12|we']]).toDF(['Key', 'Value'])
+---+---------+
|Key| Value|
+---+---------+
| 1|Foo|10|we|
| 2|Bar|11|we|
| 3|Car|12|we|
+---+---------+
推荐答案
您忘记了escape
字符,应将转义字符添加为
You forgot the escape
character, you should include escape character as
df = df.withColumn('Splitted', split(df['Value'], '\|')[0])
如果要输出为
+---+-----+--------+
|Key|Value|Splitted|
+---+-----+--------+
|1 |10 |Foo |
|2 |11 |Bar |
|3 |12 |Car |
+---+-----+--------+
你应该做
from pyspark.sql import functions as F
df = df.withColumn('Splitted', F.split(df['Value'], '\|')).withColumn('Value', F.col('Splitted')[1]).withColumn('Splitted', F.col('Splitted')[0])
这篇关于在pyspark中拆分列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!