我正在处理Twitter数据集。我有JSON格式的数据。结构为:
root
|-- _id: string (nullable = true)
|-- created_at: timestamp (nullable = true)
|-- lang: string (nullable = true)
|-- place: struct (nullable = true)
| |-- bounding_box: struct (nullable = true)
| | |-- coordinates: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: array (containsNull = true)
| | | | | |-- element: double (containsNull = true)
| | |-- type: string (nullable = true)
| |-- country_code: string (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- place_type: string (nullable = true)
|-- retweeted_status: struct (nullable = true)
| |-- _id: string (nullable = true)
| |-- user: struct (nullable = true)
| | |-- followers_count: long (nullable = true)
| | |-- friends_count: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- lang: string (nullable = true)
| | |-- screen_name: string (nullable = true)
| | |-- statuses_count: long (nullable = true)
|-- text: string (nullable = true)
|-- user: struct (nullable = true)
| |-- followers_count: long (nullable = true)
| |-- friends_count: long (nullable = true)
| |-- id_str: string (nullable = true)
| |-- lang: string (nullable = true)
| |-- screen_name: string (nullable = true)
| |-- statuses_count: long (nullable = true)
我用来计算主题标签的代码是这样的:
non_retweets = tweets.where("retweeted_status IS NULL")
hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ").filter(lambda x: x.startWith("#"))
hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
hashtag.collect()
我得到的错误是这样的:
File "<ipython-input-112-11fd8cbc056d>",line 4
hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
^
SyntaxError: Invalid syntax
我无法指出我的错误是什么。请帮忙!
最佳答案
您忘记在过滤器后添加)
。这就是为什么显示Invalid syntax
的原因。
请检查以下代码。
hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ")).filter(lambda x: x.startWith("#"))