Spark结构化流和Spark

Spark结构化流和Spark

本文介绍了Spark结构化流和Spark-Ml回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以将Spark-Ml回归应用于流媒体源?我看到有StreamingLogisticRegressionWithSGD,但这是用于较旧的RDD API的,我可以't 与结构化流媒体源一起使用.

Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources.

  1. 我应该如何在结构化流媒体源上应用回归?
  2. (一点点OT)如果我不能使用流API进行回归,如何以批处理方式向源提交偏移量? (卡夫卡水槽)

推荐答案

今天(Spark 2.2/2.3)在结构化流中不支持机器学习,也没有朝着这个方向进行的工作.请按照 SPARK-16424 跟踪未来的进展.

Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.

但是您可以:

  • 使用forEach接收器和某种形式的外部状态存储来训练迭代的非分布式模型.在较高层次上,可以这样实现回归模型:

  • Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:

  • 在调用ForeachWriter.open时获取最新模型,并为分区初始化损耗累加器(不是Spark的意思,只是局部变量).
  • 计算ForeachWriter.process中每个记录的损失并更新累加器.
  • 在调用ForeachWriter.close时,推送失败到外部存储.
  • 这将使外部存储负责计算梯度和更新模型,具体实现取决于存储.
  • Fetch latest model when calling ForeachWriter.open and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
  • Compute loss for each record in ForeachWriter.process and update accumulator.
  • Push loses to external store when calling ForeachWriter.close.
  • This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.

尝试破解SQL查询(请参见 https://github.com/holdenk/spark-structured-streaming-ml 通过 Holden Karau )

Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)

这篇关于Spark结构化流和Spark-Ml回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 10:18