本文介绍了PySpark:如何使用Ljava.lang.Object隐蔽列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过像这样从HDFS读取数据在PySpark中创建了数据框:

  df = spark.read.parquet('path/to/parquet') 

我希望数据框具有两列字符串:

  + ------------ + ------------------ +| my_column | my_other_column |+ ------------ + ------------------ +| my_string_1 | my_other_string_1 || my_string_2 | my_other_string_2 || my_string_3 | my_other_string_3 || my_string_4 | my_other_string_4 || my_string_5 | my_other_string_5 || my_string_6 | my_other_string_6 || my_string_7 | my_other_string_7 || my_string_8 | my_other_string_8 |+ ------------ + ------------------ + 

但是,我得到了 my_column 列,其中包含一些以 [Ljava.lang.Object; 开头的字符串,如下所示:

 >>df.show(truncate = False)+ ----------------------------- + ------------------ +| my_column | my_other_column |+ ----------------------------- + ------------------ +| [Ljava.lang.Object; @ 7abeeeb6 | my_other_string_1 || [Ljava.lang.Object; @ 5c1bbb1c | my_other_string_2 || [Ljava.lang.Object; @ 6be335ee | my_other_string_3 || [Ljava.lang.Object; @ 153bdb33 | my_other_string_4 || [Ljava.lang.Object; @ 1a23b57f | my_other_string_5 || [Ljava.lang.Object; @ 3a101a1a | my_other_string_6 || [Ljava.lang.Object; @ 33846636 | my_other_string_7 || [Ljava.lang.Object; @ 521a0a3d | my_other_string_8 |+ ----------------------------- + ------------------ +>>df.printSchema()根|-my_column:字符串(可为空= true)|-my_other_column:字符串(可空= true) 

如您所见, my_other_column 列看起来像预期的那样.有什么方法可以将 my_column 列中的对象转换为人类可读的字符串?

解决方案

贾罗斯拉夫(Jaroslav)

我尝试了以下代码,并使用了

I created data frame in PySpark by reading data from HDFS like this:

df = spark.read.parquet('path/to/parquet')

I expect the data frame to have two column of strings:

+------------+------------------+
|my_column   |my_other_column   |
+------------+------------------+
|my_string_1 |my_other_string_1 |
|my_string_2 |my_other_string_2 |
|my_string_3 |my_other_string_3 |
|my_string_4 |my_other_string_4 |
|my_string_5 |my_other_string_5 |
|my_string_6 |my_other_string_6 |
|my_string_7 |my_other_string_7 |
|my_string_8 |my_other_string_8 |
+------------+------------------+

However, I get my_column column with some strings starting with [Ljava.lang.Object;, looking like this:

>> df.show(truncate=False)
+-----------------------------+------------------+
|my_column                    |my_other_column   |
+-----------------------------+------------------+
|[Ljava.lang.Object;@7abeeeb6 |my_other_string_1 |
|[Ljava.lang.Object;@5c1bbb1c |my_other_string_2 |
|[Ljava.lang.Object;@6be335ee |my_other_string_3 |
|[Ljava.lang.Object;@153bdb33 |my_other_string_4 |
|[Ljava.lang.Object;@1a23b57f |my_other_string_5 |
|[Ljava.lang.Object;@3a101a1a |my_other_string_6 |
|[Ljava.lang.Object;@33846636 |my_other_string_7 |
|[Ljava.lang.Object;@521a0a3d |my_other_string_8 |
+-----------------------------+------------------+

>> df.printSchema()
root
 |-- my_column: string (nullable = true)
 |-- my_other_column: string (nullable = true)

As you can see, my_other_column column is looking as expected. Is there any way, how to convert objects in my_column column to humanly readable strings?

解决方案

Jaroslav,

I tried with the following code, and have used a sample parquet file from here. I am able to get the desired output from the dataframe, can u please chk your code using the code snippet below and also sample file referred above to see if there's any other issue:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read a Parquet file").getOrCreate()
df = spark.read.parquet('E:\\...\\..\\userdata1.parquet')
df.show(10)
df.printSchema()

Replace the path to your HDFS location.

Dataframe output for your reference:

这篇关于PySpark:如何使用Ljava.lang.Object隐蔽列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 06:23