问题描述
当我开始学习PySpark时,我使用列表创建了dataframe
.现在已经不建议从列表中推断模式,我得到了一个警告,它建议我改用pyspark.sql.Row
.但是,当我尝试使用Row
创建一个时,出现推断架构问题.这是我的代码:
When I began learning PySpark, I used a list to create a dataframe
. Now that inferring the schema from list has been deprecated, I got a warning and it suggested me to use pyspark.sql.Row
instead. However, when I try to create one using Row
, I get infer schema issue. This is my code:
>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)
这将导致以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>
所以我创建了一个模式
>>> schema = StructType([StructField('name', StringType()),
... StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)
但是,此错误被抛出.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
data = list(data)
File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
verify_func(obj, schema)
File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>
推荐答案
createDataFrame
函数采用行列表(除其他选项外)加上架构,因此正确的代码应该是正确的像:
The createDataFrame
function takes a list of Rows (among other options) plus the schema, so the correct code would be something like:
from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()
出局:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
+-------+---+
| name|age|
+-------+---+
|Severin| 33|
| John| 48|
+-------+---+
在pyspark文档中( link ),您可以找到有关createDataFrame函数的更多详细信息.
In the pyspark docs (link) you can find more details about the createDataFrame function.
这篇关于从行创建DataFrame会导致“推断架构问题"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!