本文介绍了列出到pyspark中的DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
有人可以告诉我如何将包含字符串的列表转换为pyspark中的数据框.我正在使用带有火花2.2.1的python 3.6.我刚刚开始学习Spark环境,我的数据如下所示:
Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
现在,我想按如下方式创建一个数据框
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
我什至要添加与数据无关的ID列
I even want to add ID column which is not associated in the data
推荐答案
您可以将列表转换为Row对象的列表,然后使用spark.createDataFrame
从数据中推断出模式:
You can convert the list to a list of Row objects, then use spark.createDataFrame
which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
这篇关于列出到pyspark中的DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!