问题描述
我正在用Apache Spark开发基于Scala的极限学习机.我的模型必须是Spark Estimator,并使用Spark框架才能适合机器学习管道.有谁知道Breeze是否可以与Spark串联使用?我所有的数据都在Spark数据帧中,可以想象,我可以使用Breeze导入它,使用Breeze DenseVectors作为数据结构,然后转换为Estimator部分的DataFrame. Breeze的优点在于,它对Moore-Penrose伪逆具有函数pinv
,对于非方阵是逆的.据我所知,Spark MLlib中没有等效功能.我不知道是否可以将Breeze张量转换为Spark DataFrames,因此,如果有人对此有经验,那将非常有用.谢谢!
I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv
for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
推荐答案
-
微风可以与Spark一起使用.实际上,许多MLLib函数在内部使用它,但所需的转换未显示为
public
.您可以添加自己的转化并使用Breeze处理单个记录.Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as
public
. You can add your own conversions and use Breeze to process individual records.例如对于
Vectors
,您可以找到转换代码:For example for
Vectors
you can find conversion code:对于
Matrices
,请参见 Matrices.scala但是,它不能用于分布式数据结构. Breeze对象使用低级库,该库不能用于分布式处理.因此,只有将数据
collect
到驱动程序时,DataFrame
-Breeze对象转换才可能进行,并且仅限于可以在驱动程序存储器中存储数据的情况.It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore
DataFrame
- Breeze objects conversions are possible only if youcollect
data to the driver and are limited to the scenarios where data can be stored in the driver memory.还有其他库,例如SysteML, ,它们与Spark集成并提供更全面的线性代数分布式对象上的例程.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.
这篇关于Spark和ScalaNLP库Breeze可以一起使用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!