问题描述
我对 Apache Spark和Scala 完全陌生,并且在将.csv文件映射到键值(例如JSON)结构时遇到问题.
I'm totally new to Apache Spark and Scala, and I'm having problems with mapping a .csv file into a key-value (like JSON) structure.
我想要完成的是获取.csv文件:
What I want to accomplish is to get the .csv file:
user, timestamp, event
ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:52:56,USER_PURCHASED
ad0e431a69cb3b445ddad7bb97f55665,2015-03-06 13:52:57,USER_SHARED
83b2d8a2c549fbab0713765532b63b54,2015-03-06 13:52:57,USER_SUBSCRIBED
ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST
...
进入这样的结构:
ec79fcac8c76ebe505b76090f03350a2: [(2015-03-06 13:52:56,USER_PURCHASED), (2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST)]
ad0e431a69cb3b445ddad7bb97f55665: [(2015-03-06 13:52:57,USER_SHARED)]
83b2d8a2c549fbab0713765532b63b54: [(2015-03-06 13:52:57,USER_SUBSCRIBED)]
...
如果通过以下方式读取文件,该怎么办?
How can this be done if the file is read by:
val csv = sc.textFile("file.csv")
非常感谢您的帮助!
推荐答案
类似的东西
case class MyClass(user: String, date: String, event: String)
def csvToMyClass(line: String) =
{
val split = line.split(',')
// This is a good place to do validations
// And convert strings to numbers, enums, UUIDs, etc.
MyClass(split(0), split(1), split(2))
}
val csv = sc.textFile("file.csv")
.map(scvToMyClass)
当然,要做更多的工作,以便在类上拥有更多具体的数据类型,而不仅仅是字符串...
Of course, do a little more work to have more concrete data types on your class rather than just strings...
这是用于将CSV文件读入结构中(似乎是您的主要问题).如果随后需要合并单个用户的所有数据,则可以映射到键/值元组(String->(String,String))
并使用 .aggregateByKey()
加入用户的所有元组.然后,您的聚合函数可以返回所需的任何结构.
This is for reading the CSV file into a structure (seems to be your main question). If you then need to merge all data for a single user you can map to a key/value tuple (String -> (String, String))
instead and use .aggregateByKey()
to join all tuples for a user. Your aggregation function can then return whatever structure you want.
这篇关于Apache Spark:将CSV文件映射到键:值格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!