问题描述
首先是假想的用例。比方说,我有一串元组(user_id,time_stamp,login_ip)
。我想维护每个用户的最后一次登录IP,粒度为5秒。 使用Spark流,我可以使用 updateStateByKey
方法更新此映射。问题是,随着数据流的不断涌现,每个时间间隔的RDD变得越来越大,因为可以看到更多 user_ids
。过了一段时间,地图会变得很大,以至于需要更长的时间,因此无法实现结果的实时传递。
请注意,这只是一个简单的例子,我想出了这个问题。真正的问题可能会更复杂,并且确实需要实时交付。
有关如何解决这个问题的任何想法(在Spark以及其他解决方案中都会很好) ?
您并未完全更新 Map
。你给的函数只是更新与一个关键相关的状态,而Spark正在做其余的事情。特别是它为你保留了一个类似于map 的RDD
的键状态对 - 实际上,它们是一系列的 DStream
。因此,状态的存储和更新像其他任何事物一样分布。如果更新速度不够快,可以通过添加更多工作人员来扩大规模。
First the imaginary use case. Let's say I have a stream of tuples (user_id, time_stamp, login_ip)
. I want to maintain the last login IP of each user at 5 seconds granularity.
Using Spark streaming, I can use the updateStateByKey
method to update this map. The problem is, as the stream of data keeps coming, the RDD of each time interval is becoming larger and larger because more user_ids
are seen. After sometime, the map will become so large that maintaining it takes longer time, thus the real-time delivery of the result can not be achieved.
Note that this is just a simple example that I come up with to show the problem. Real problems could be more complicated and really need real-time delivery.
Any idea (In Spark as well as other solutions will all be good) on how to solve this problem?
You're not quite updating a Map
. The function you give is just updating state associated to one key, and Spark is doing the rest. In particular it is maintaining a map-like RDD
of key-state pairs for you -- really, a series of them, a DStream
. So the storage and update of the state is distributed like everything else. If the update isn't fast enough, you can scale up by adding more workers.
这篇关于如何在使用Spark的有状态操作updateStateByKey时保持实时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!