问题描述
我想用字典制作一个新的数据框.字典包含列名作为键,并包含列数据的列表作为值.例如:
I want to make a new dataframe from a dictionary. The dictionary contains column names as keys and lists of columnar data as values. For example:
col_dict = {'col1': [1, 2, 3],
'col2': [4, 5, 6]}
我需要它作为如下所示的数据框:
I need this as a dataframe that looks like this:
+------+------+
| col1 | col2 |
+------+------+
| 1| 4|
| 2| 5|
| 3| 6|
+------+------+
似乎没有简单的方法可以做到这一点.
It doesn't seem like there's an easy way to do this.
推荐答案
最简单的方法是创建一个熊猫DataFrame并将其转换为Spark DataFrame:
Easiest way is to create a pandas DataFrame and convert to a Spark DataFrame:
col_dict = {'col1': [1, 2, 3],
'col2': [4, 5, 6]}
import pandas as pd
pandas_df = pd.DataFrame(col_dict)
df = sqlCtx.createDataFrame(pandas_df)
df.show()
#+----+----+
#|col1|col2|
#+----+----+
#| 1| 4|
#| 2| 5|
#| 3| 6|
#+----+----+
没有熊猫
如果没有可用的熊猫,则只需将数据处理为适用于createDataFrame()
函数的形式.引用上一个答案:
Without Pandas
If pandas is not available, you'll just have to manipulate your data into a form that works for the createDataFrame()
function. Quoting myself from a previous answer:
colnames, data = zip(*col_dict.items())
print(colnames)
#('col2', 'col1')
print(data)
#([4, 5, 6], [1, 2, 3])
现在,我们需要修改数据,以便它是一个元组列表,其中每个元素都包含对应列的数据.幸运的是,使用zip
很容易:
Now we need to modify data so that it's a list of tuples, where each element contains the data for the corresponding column. Luckily, this is easy using zip
:
data = zip(*data)
print(data)
#[(4, 1), (5, 2), (6, 3)]
现在拨打createDataFrame()
:
df = sqlCtx.createDataFrame(data, colnames)
df.show()
#+----+----+
#|col2|col1|
#+----+----+
#| 4| 1|
#| 5| 2|
#| 6| 3|
#+----+----+
这篇关于如何从字典创建数据框,其中每个项目都是PySpark中的一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!