我目前正在学习Spark,遇到一个问题,即给出两个文本文件来查找带有100多个单词的文本评论的书,并过滤结果以仅显示恐怖类别。这是我的两个文本文件的示例。BookInformation.data:在此数据文件中,我有4个键。userName, price, categories, title每个键都有一个值,并且每个键都用,作为分隔符。一些键使用字符串值,而另一些键使用整数值。{"username": "JAMES250", "price": 19.20, "categories": "Horror", "title": "Friday the 13th"}{"username": "Bro2KXA1", "price": 09.21, "categories": "Fantasy", "title": "Wizard of Oz"}{"username": "LucyLu1272", "price": 18.69, "categories": "Fiction", "title": "Not Real"}{"username": "6302049040", "price": 08.86, "categories": "Fantasy", "title": "Fantastic"}...etc...ReviewerText.data在此数据文件中,我有5个键。reviewerID, userName, reviewText, overall, reviewTime每个键都有一个值,并且每个键都用,作为分隔符。一些键使用字符串值,而另一些键使用整数值。{"reviewerID": "A1R3P8MRFSN4X3", "username": "JAMES250", "reviewText": "Wow what a book blah blah… END", "overall": 4.0, "reviewTime": "08 9, 1997"}{"reviewerID": "AVM91SKZ9M58T", " username ": " Bro2KXA1 ", "reviewText": "Different Blah Blah Blah Blah… So on… END", "overall": 5.0, "reviewTime": "08 10, 1997"}{"reviewerID": "A1HC72VDRLANIW", " username ": "DiffUser09", "reviewText": "Another Review Blah Blah Blah Blah… So on… END", "overall": 1.0, "reviewTime": "08 19, 1997"}{"reviewerID": "A2XBTS97FERY2Q", " username ": "MyNameIs01", "reviewText": "I love books. END", "overall": 5.0, "reviewTime": "08 23, 1997"}...etc...我的目标很简单。首先,我想检查ReviewInformation.data中是否有超过100个单词的reviewText。一旦发现每个reviewText具有超过100个单词,我想按overall等级的顺序对结果进行排序;从5到1。然后我还需要为每一个都打印相应的Title。之后,我需要重新启动过滤器,并且只需要从BookInformation.data中过滤出categories以仅显示Horror类别。然后计算reviewText类别在Horror中出现的平均单词数。码:到目前为止,我正在为每个文件中的每个行条目创建一个Key:Value数组。这里的目标是创建一个我可以解析任何键并接收其值的数组。package main.scalaimport org.apache.spark.{SparkConf, SparkContext}import scala.io.StdIn.readLineimport scala.io.Sourceobject ReviewDataSpark { def main(args: Array[String]) { //Create a SparkContext to initialize Spark val conf = new SparkConf() conf.setMaster("local") conf.setAppName("Word Count") val sc = new SparkContext(conf) val metaDataFile = sc.textFile("/src/main/resources/BookInformation.data") val reviewDataFile = sc.textFile("/src/main/resources/ReviewText.data") reviewDataFile.flatMap { line => { val Array(label, rest) = line split "," println(Array) val items = rest.trim.split("\\s+") println(items) items.map(item => (label.trim -> item)) } } metaDataFile.flatMap { line => { val Array(label, rest) = line split "," println(Array) val items = rest.trim.split("\\s+") println(items) items.map(item => (label.trim -> item)) } } }}问题:因此,我的代码主要问题是我认为我没有正确使用flatMap。我似乎无法将键和值溢出到键数组中。我的代码只是打印出来:Process finished with exit code 0似乎不正确。编辑:所以我更新了代码以使用JSON库。val jsonColName = "json"// intermediate column name where we place each line of source dataval jsonCol = col(jsonColName) // its reusable refval metaDataSet = spark.read.textFile("src/main/resources/BookInformation.data") .toDF(jsonColName).select(get_json_object(jsonCol, "$.username") .alias("username"), get_json_object(jsonCol, "$.price") .alias("price"), get_json_object(jsonCol, "$.categories") .alias("categories"), get_json_object(jsonCol, "$.title") .alias("title"))val reviewDataSet = spark.read.textFile("src/main/resources/reviewText.data") .toDF(jsonColName).select(get_json_object(jsonCol, "$.reviewerID") .alias("reviewerID"), get_json_object(jsonCol, "$.username") .alias("username"), get_json_object(jsonCol, "$.reviewText") .alias("reviewText"), get_json_object(jsonCol, "$.overall") .alias("overall").as[Double], get_json_object(jsonCol, "$.reviewTime") .alias("reviewTime"))reviewDataSet.show()metaDataSet.show()然后,由于这些信息,我得以合并。val joinedDataSets = metaDataSet.join(reviewDataSet, Seq("username")) joinedDataSets.show()现在,我的下一步是能够计算joinedDataSets列中ReviewText内的单词数,并且仅保留100个单词以上的单词。如何从键reviewText过滤JSON对象,然后计算所有条目并删除少于100个单词的条目。 最佳答案 首先,您需要以结构化的方式从文件中加载数据。源文件的每一行都可以解析为JSON,并且信息应正确放置在相应的列中。例如,要加载和解析BookInformation.data:import org.apache.spark.sql.functions._ // necessary for col, get_json_object functions and others belowval session = SparkSession.builder().appName("My app") .master("local[*]") .getOrCreate()val bookInfoFilePath = // path to BookInformation.dataval jsonColName = "json" // intermediate column name where we place each line of source dataval jsonCol = col(jsonColName) // its reusable refval bookInfoDf = session.read.textFile(bookInfoFilePath).toDF(jsonColName).select( get_json_object(jsonCol, "$.username").alias("username"), get_json_object(jsonCol, "$.price").alias("price"), get_json_object(jsonCol, "$.categories").alias("categories"), get_json_object(jsonCol, "$.title").alias("title"))现在,我们有了一个包含适当结构化数据的图书信息DataFrame:bookInfoDf.show()+----------+-----+----------+---------------+| username|price|categories| title|+----------+-----+----------+---------------+| JAMES250| 19.2| Horror|Friday the 13th|| Bro2KXA1| 9.21| Fantasy| Wizard of Oz||LucyLu1272|18.69| Fiction| Not Real||6302049040| 8.86| Fantasy| Fantastic|+----------+-----+----------+---------------+问题3和问题4的答案变得很明显。val dfQuestion3 = bookInfoDf.where($"categories" === "Horror")dfQuestion3.show()+--------+-----+----------+---------------+|username|price|categories| title|+--------+-----+----------+---------------+|JAMES250| 19.2| Horror|Friday the 13th|+--------+-----+----------+---------------+对于Q4,您必须使用bookInfoDf列将ReviewerText.data与从username加载的DataFrame联接在一起,然后将.agg列(reviewText和函数)。要加载avg,可以类比地精确地进行上述length的加载。在ReviewerText.data调用后,应使用bookInfoDf将overall列转换为数字。更新资料 我对如何计算JSON键/值中的单词数有疑问。例如,在关键的reviewText中,我创建了BookInformation和ReviewText并将其合并到一个数据集中。现在,如果我想遍历每个reviewText并计算单词的数量,然后根据“键值”中的单词数量过滤是否保留或删除该怎么做?我正在尝试学习如何挖掘价值做到这一点的一种可能方法是计算单词数并将其存储在专用列中:// reviewerTextDf is the DataFrame with original data from ReviewerText.dataval dfWithReviewWordsCount = reviewerTextDf.withColumn("nb_words_review", size(split($"reviewText", "\\s+")))dfWithReviewWordsCount.show()给出以下内容:+--------------+--------+--------------------+-------+-----------+---------------+| reviewerID|username| reviewText|overall| reviewTime|nb_words_review|+--------------+--------+--------------------+-------+-----------+---------------+|A1R3P8MRFSN4X3|JAMES250|Wow what a book b...| 4.0| 08 9, 1997| 7|| AVM91SKZ9M58T| null|Different Blah Bl...| 5.0|08 10, 1997| 8||A1HC72VDRLANIW| null|Another Review Bl...| 1.0|08 19, 1997| 9||A2XBTS97FERY2Q| null| I love books. END| 5.0|08 23, 1997| 4|+--------------+--------+--------------------+-------+-----------+---------------+ 08-28 05:06