本文介绍了Apache Spark 使用管道分隔的 CSV 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我对 Apache Spark 非常陌生,我正在尝试将 SchemaRDD 与我的管道分隔文本文件一起使用.我在 Mac 上使用 Scala 10 独立安装了 Spark 1.5.2.我有一个包含以下代表性数据的 CSV 文件,我试图根据记录的第一个值(列)将以下内容拆分为 4 个不同的文件.我非常感谢我能得到的任何帮助.

I am very new to Apache Spark and am trying to use SchemaRDD with my pipe delimited text file. I have a standalone installation of Spark 1.5.2 on my Mac using Scala 10. I have a CSV file with the following representative data and I am trying to split the following into 4 different files based on the first value (column) of the record. I would very much appreciate any help I can get with this.

3|36|CSPAN: Cable Satellite Public Affairs Network
3|278|CMT: Country Music Television
4|625363|1852400|Matlock|9212|The Divorce
4|625719|1852400|Matlock|16|The Rat Pack


注意:您的 csv 文件每行的字段数不同 - 这无法按原样解析为 DataFrame.(SchemaRDD 已重命名为 DataFrame.)如果您的 csv 文件格式正确,您可以执行以下操作:

Note: Your csv file does not have the same number of fields in each row - this cannot be parsed as is into a DataFrame. (SchemaRDD has been renamed to DataFrame.) Here is something you can do if your csv file were well-formed:

使用 --packages com.databricks:spark-csv_2.10:1.3.0 启动 spark-shell 或 spark-submit 以便轻松解析 csv 文件(见这里).在 Scala 中,您的代码将是,假设您的 csv 文件有一个标题 - 如果是,则更容易引用列:

launch spark-shell or spark-submit with --packages com.databricks:spark-csv_2.10:1.3.0 in order to parse csv files easily (see here). In Scala, your code would be, assuming your csv file has a header - if yes, it is easier to refer to columns:

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", '|').load("/path/to/file.csv")
// assume 1st column has name col1
val df1 = df.filter( df("col1") === 1)  // 1st DataFrame
val df2 = df.filter( df("col1") === 2)  // 2nd DataFrame  etc...


Since your file is not well formed, you would have to parse each of the different lines differently, so for example, do the following:

val lines = sc.textFile("/path/to/file.csv")

case class RowRecord1( col1:Int, col2:Double, col3:String, col4:Int)
def parseRowRecord1( arr:Array[String]) = RowRecord1( arr(0).toInt, arr(1).toDouble, arr(2), arr(3).toInt)

case class RowRecord2( col1:Int, col2:String, col3:Int, col4:Int, col5:Int, col6:Double, col7:Int)
def parseRowRecord2( arr:Array[String]) = RowRecord2( arr(0).toInt, arr(1), arr(2).toInt, arr(3).toInt, arr(4).toInt, arr(5).toDouble, arr(8).toInt)

val df1 = lines.filter(_.startsWith("1")).map( _.split('|')).map( arr => parseRowRecord1( arr )).toDF
val df2 = lines.filter(_.startsWith("2")).map( _.split('|')).map( arr => parseRowRecord2( arr )).toDF

这篇关于Apache Spark 使用管道分隔的 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 21:43