我有一个由 7-8 个字段组成的数据集,这些字段的类型为 String、Int &浮动.

I have a dataset consisting of 7-8 fields which are of type String, Int & Float.


Am trying to create Schema by Programmatic approach by using this :

val schema = StructType(header.split(",").map(column => StructField(column, StringType, true)))

然后将其映射到 Row 类型,例如:

And Then mapping it to Row type like :

val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8)))

但是在我使用 DF.show() 创建 DataFrame 之后,它给出了 Integer 字段的错误.

But after creating DataFrame when i use DF.show() it gives error for the Integer field.


So how to create such schema where we have multiple data type in the dataset


您在代码中遇到的问题是您将所有字段分配为 StringType.

The problem you have in your code is that you are assigning all the fields as StringType.


Assuming that in the header you have only the name of the fields, then you can't guess the type.


Let's assume that the header string is like this

val header = "field1:Int,field2:Double,field3:String"


def inferType(field: String) = field.split(":")(1) match {
   case "Int" => IntegerType
   case "Double" => DoubleType
   case "String" => StringType
   case _ => StringType

val schema = StructType(header.split(",").map(column => StructField(column, inferType(column), true)))


For the header string example you get

 |-- field1:Int: integer (nullable = true)
 |-- field2:Double: double (nullable = true)
 |-- field3:String: string (nullable = true)

另一方面.如果您需要的是来自文本的数据框,我建议您直接从文件本身创建数据框.从 RDD 创建它是没有意义的.

On the other hand. If what you need it's a data frame from text, I would suggest that you create the DataFrame directly from the file itself. It's pointless to create it from an RDD.

val fileReader = spark.read.format("com.databricks.spark.csv")
  .option("mode", "DROPMALFORMED")
  .option("header", "true")
  .option("inferschema", "true")
  .option("delimiter", ",")

val df = fileReader.load(PATH_TO_FILE)

