问题描述
是否可以将Spark-Ml回归应用于流媒体源?我看到有StreamingLogisticRegressionWithSGD
,但这是用于较旧的RDD API的,我可以't 与结构化流媒体源一起使用.
Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD
but It's for older RDD API and I couldn't use It with structured streaming sources.
- 我应该如何在结构化流媒体源上应用回归?
- (一点点OT)如果我不能使用流API进行回归,如何以批处理方式向源提交偏移量? (卡夫卡水槽)
推荐答案
今天(Spark 2.2/2.3)在结构化流中不支持机器学习,也没有朝着这个方向进行的工作.请按照 SPARK-16424 跟踪未来的进展.
Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.
但是您可以:
-
使用forEach接收器和某种形式的外部状态存储来训练迭代的非分布式模型.在较高层次上,可以这样实现回归模型:
Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:
- 在调用
ForeachWriter.open
时获取最新模型,并为分区初始化损耗累加器(不是Spark的意思,只是局部变量). - 计算
ForeachWriter.process
中每个记录的损失并更新累加器. - 在调用
ForeachWriter.close
时,推送失败到外部存储. - 这将使外部存储负责计算梯度和更新模型,具体实现取决于存储.
- Fetch latest model when calling
ForeachWriter.open
and initialize loss accumulator (not in a Spark sense, just local variable) for the partition. - Compute loss for each record in
ForeachWriter.process
and update accumulator. - Push loses to external store when calling
ForeachWriter.close
. - This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
尝试破解SQL查询(请参见 https://github.com/holdenk/spark-structured-streaming-ml 通过 Holden Karau )
Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)
这篇关于Spark结构化流和Spark-Ml回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!