Spark结构化流和Spark-Ml回归 | Spark结构化流和Spark

本文介绍了Spark结构化流和Spark-Ml回归的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否可以将Spark-Ml回归应用于流媒体源?我看到有StreamingLogisticRegressionWithSGD，但这是用于较旧的RDD API的，我可以't 与结构化流媒体源一起使用.

Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources.

我应该如何在结构化流媒体源上应用回归?
(一点点OT)如果我不能使用流API进行回归，如何以批处理方式向源提交偏移量? (卡夫卡水槽)

推荐答案

今天(Spark 2.2/2.3)在结构化流中不支持机器学习，也没有朝着这个方向进行的工作.请按照 SPARK-16424 跟踪未来的进展.

Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.

但是您可以:

使用forEach接收器和某种形式的外部状态存储来训练迭代的非分布式模型.在较高层次上，可以这样实现回归模型:

Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:

在调用ForeachWriter.open时获取最新模型，并为分区初始化损耗累加器(不是Spark的意思，只是局部变量).
计算ForeachWriter.process中每个记录的损失并更新累加器.
在调用ForeachWriter.close时，推送失败到外部存储.
这将使外部存储负责计算梯度和更新模型，具体实现取决于存储.

Fetch latest model when calling ForeachWriter.open and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
Compute loss for each record in ForeachWriter.process and update accumulator.
Push loses to external store when calling ForeachWriter.close.
This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.

尝试破解SQL查询(请参见 https://github.com/holdenk/spark-structured-streaming-ml 通过 Holden Karau )

Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)

这篇关于Spark结构化流和Spark-Ml回归的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！