如何在Spark中从文本文件创建DataFrame

如何在Spark中从文本文件创建DataFrame

本文介绍了如何在Spark中从文本文件创建DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HDFS上有一个文本文件,我想将其转换为Spark中的数据框.

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.

我正在使用Spark上下文加载文件,然后尝试从该文件生成单独的列.

I am using the Spark Context to load the file and then try to generate individual columns from that file.

val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))

完成此操作后,我正在尝试以下操作.

After doing this, I am trying the following operation.

myFile1.toDF()

我遇到了一个问题,因为myFile1 RDD中的元素现在是数组类型.

I am getting an issues since the elements in myFile1 RDD are now array type.

我该如何解决这个问题?

How can I solve this issue?

推荐答案

更新-从 Spark 1.6 开始,您可以简单地使用内置的csv数据源:

Update - as of Spark 1.6, you can simply use the built-in csv data source:

spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")

您还可以使用各种选项来控制CSV解析,例如:

You can also use various options to control the CSV parsing, e.g.:

val df = spark.read.option("header", "false").csv("file.txt")

对于Spark版本< 1.6 :最简单的方法是使用 spark-csv -将其包含在依赖项中并遵循自述文件,允许设置自定义定界符(;),可以读取CSV标头(如果有),并且可以推断模式 types (这需要额外扫描数据).

For Spark version < 1.6:The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).

或者,如果您知道该模式,则可以创建一个表示该模式的案例类,然后将RDD元素映射到该类的实例中,然后再转换为DataFrame,例如:

Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:

case class Record(id: Int, name: String)

val myFile1 = myFile.map(x=>x.split(";")).map {
  case Array(id, name) => Record(id.toInt, name)
}

myFile1.toDF() // DataFrame will have columns "id" and "name"

这篇关于如何在Spark中从文本文件创建DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 18:27