我手动创建PySpark DataFrame,如下所示:

acdata = sc.parallelize([
[('timestamp', 1506340019), ('pk', 111), ('product_pk', 123), ('country_id', 'FR'), ('channel', 'web')]
])
# Convert to tuple
acdata_converted = acdata.map(lambda x: (x[0][1], x[1][1], x[2][1]))

# Define schema
acschema = StructType([
    StructField("timestamp", LongType(), True),
    StructField("pk", LongType(), True),
    StructField("product_pk", LongType(), True),
    StructField("country_id", StringType(), True),
    StructField("channel", StringType(), True)
])

df = sqlContext.createDataFrame(acdata_converted, acschema)


但是当我编写df.head()并执行spark-submit时,出现以下错误:

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000001/pyspark.zip/pyspark/sql/session.py", line 520, in prepare
  File "/mnt/yarn/usercache/hdfs/appcache/application_1510134261242_0002/container_1510134261242_0002_01_000003/pyspark.zip/pyspark/sql/types.py", line 1358, in _verify_type
    "length of fields (%d)" % (len(obj), len(dataType.fields)))
ValueError: Length of object (3) does not match with length of fields (12)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


这是什么意思,如何解决?

最佳答案

您需要映射所有5个字段以与定义的模式匹配。

    acdata_converted = acdata.map(lambda x: (x[0][1], x[1][1], x[2][1], x[3][1], x[4][1]))

关于python - ValueError:对象(3)的长度与字段的长度不匹配,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47177112/

10-12 15:56