问题描述
SparkSession.createDataset()
仅允许 List,RDD或Seq
-但不支持 JavaPairRDD
.
因此,如果我有一个要从中创建 Dataset
的 JavaPairRDD< String,User>
,则对于 SparkSession.createDataset()
限制,以创建包含两个字段的包装器 UserMap
类: String
和 User
.
So if I have a JavaPairRDD<String, User>
that I want to create a Dataset
from, would a viable workround for the SparkSession.createDataset()
limitation to create a wrapper UserMap
class that contains two fields: String
and User
.
然后执行 spark.createDataset(userMap,Encoders.bean(UserMap.class));
?
推荐答案
如果您可以将 JavaPairRDD
转换为 List< Tuple2< K,V>>
,那么您可以使用带List的createDataset方法.请参见下面的示例代码.
If you can convert the JavaPairRDD
to List<Tuple2<K, V>>
then you can use createDataset method which takes List. See below sample code.
JavaPairRDD<String, User> pairRDD = ...;
Dataset<Row> df = spark.createDataset(pairRDD.collect(), Encoders.tuple(Encoders.STRING(),Encoders.bean(User.class))).toDF("key","value");
或者您可以转换为RDD
or you can convert to RDD
Dataset<Row> df = spark.createDataset(JavaPairRDD.toRDD(pairRDD), Encoders.tuple(Encoders.STRING(),Encoders.bean(User.class))).toDF("key","value");
这篇关于如何将JavaPairRDD转换为数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!