问题描述
例如,以单词计数为例,当应用程序启动并长时间运行时,接收到单词"Spark"
,则结果表中将有一行(Spark,1),
Take word count for example, when the application startup and long runs, and receive a word "Spark"
, then in the result table, there is a row (Spark,1),
应用程序运行1天甚至一个星期后,应用程序再次收到"Spark"
,因此结果表应具有一行(spark,2).
After the application has been running for 1 day or even one week, the application receives "Spark"
again, so that the result table should have a row (spark,2).
我只是在上面的场景中提出一个问题:无边界表如何保持接收到的数据的状态,因为在应用程序长时间运行之后,状态可能会非常巨大.
I am just using above scenario to raise the question: How the unbounded table keeps the state of the data it receives,since the state could be super huge after the application runs for a long time.
此外,在使用"Complete"
输出模式时,如果结果表非常大,那么将结果表中的所有数据写到接收器上将非常耗时
Also, when using "Complete"
output mode, if the resulting table is very large, then write out all the data in resulting table to sink will be very time expensive
推荐答案
为避免内存中的大量数据,火花结构化流使用水印.主要思想是仅将特定时间窗口内的数据存储在内存中.该窗口之外的所有数据都存储在文件系统中.您可以阅读有关水印的信息此处或此处
To avoid this huge amount of data in memory spark structured streaming uses watermarks. The main idea is to store in memory only data within specific time window. All the data outside this window are stored in file system. You can read about watermarks here or here
这篇关于未绑定表如何在Spark结构化流中工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!