本文介绍了Spark和ScalaNLP库Breeze可以一起使用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Apache Spark开发基于Scala的极限学习机.我的模型必须是Spark Estimator,并使用Spark框架才能适合机器学习管道.有谁知道Breeze是否可以与Spark串联使用?我所有的数据都在Spark数据帧中,可以想象,我可以使用Breeze导入它,使用Breeze DenseVectors作为数据结构,然后转换为Estimator部分的DataFrame. Breeze的优点在于,它对Moore-Penrose伪逆具有函数pinv,对于非方阵是逆的.据我所知,Spark MLlib中没有等效功能.我不知道是否可以将Breeze张量转换为Spark DataFrames,因此,如果有人对此有经验,那将非常有用.谢谢!

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!

推荐答案

  • 微风可以与Spark一起使用.实际上,许多MLLib函数在内部使用它,但所需的转换未显示为public.您可以添加自己的转化并使用Breeze处理单个记录.

    • Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.

      例如对于Vectors,您可以找到转换代码:

      For example for Vectors you can find conversion code:

      对于Matrices,请参见 Matrices.scala

      但是,它不能用于分布式数据结构. Breeze对象使用低级库,该库不能用于分布式处理.因此,只有将数据collect到驱动程序时,DataFrame-Breeze对象转换才可能进行,并且仅限于可以在驱动程序存储器中存储数据的情况.

      It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.

      还有其他库,例如SysteML, ,它们与Spark集成并提供更全面的线性代数分布式对象上的例程.

      There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

      这篇关于Spark和ScalaNLP库Breeze可以一起使用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 12:20
查看更多