问题描述
我申请了一些机器学习算法,如线性回归,Logistic回归,以及朴素贝叶斯一些数据,但我试图避免使用RDDS并开始使用DataFrames因为的慢(见图片1 )。
I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1).
其他的原因,我现在用的DataFrames是因为毫升库有一个类来调整模型非常有用的是<一个href=\"http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=crossvalidator#pyspark.ml.tuning.CrossValidator\">CrossValidator这个类装修它,显然这种方法来测试几种方案后返回模式,返回后的(与参数最佳组合)。
The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after that returns a fitted model (with the best combinations of parameters).
我使用集群不是那么大,数据是pretty大,有些装修需要几个小时,所以我要保存这个模型以后重新使用他们,但我还没有意识到怎么回事,是不是我我忽略了?
The cluster I use isn't so large and the data is pretty big and some fitting take hours so I want to save this models to reuse them later, but I haven't realized how, is there something I am ignoring?
注:
- 的mllib的模型类有一个保存方法(即<一href=\"http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=mllib#pyspark.mllib.classification.NaiveBayesModel.save\">NaiveBayes),但mllib没有CrossValidator和使用RDDS所以我避免它premeditatedly。
- 的当前版本为1.5.1的火花。
- The mllib's model classes have a save method (i.e. NaiveBayes), but mllib does not have CrossValidator and use RDDs so I am avoiding it premeditatedly.
- The current version is spark 1.5.1.
推荐答案
由于星火1.6有可能使用保存
方法保存您的模型。因为几乎每个模式
实施的接口。例如, LinearRegressionModel 有它,因此它可能用它来你的模型保存到指定路径。
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save
method. Because almost every model
implement MLWritable interface. For example, LinearRegressionModel has it, and therefore it's possible to save your model to the desired path using it.
我相信你在这里做不正确的假设。
I believe you're making incorrect assumptions here.
上的某些操作的 DataFrames
可以进行优化,并将其转换为更高的性能相比普通的 RDDS
。 DataFrames
提供高效的缓存缓存和SQLish API无疑更容易COM prehend比RDD API。
Some operations on a DataFrames
can be optimized and it translates to improved performance compared to plain RDDs
. DataFrames
provide efficient caching caching and SQLish API is arguably easier to comprehend than RDD API.
ML管道是非常有用的,像交叉验证器或不同的评估工具只是必须具备在任何机器管道,即使没有上述特别辛苦做低水平MLlib API之上实现它好得多必须准备好使用,普遍性和相对良好测试的解决方案。
ML Pipelines are extremely useful and tools like cross-validator or different evaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
到目前为止好,但也存在一些问题:
So far so good, but there are a few problems:
- 据我可以告诉在
DataFrames
如选择
或withColumn简单的操作
显示性能类似于其RDD等值像地图
, - 在某些情况下,在一个典型的管道成长列数实际上可以降低性能相比以及调谐低电平变换。你当然也可以在途中添加下拉列变压器纠正的是,
- 许多ML算法,包括
ml.classification.NaiveBayes
的, - PySpark ML / MLlib算法委托实际处理其斯卡拉同行,
- 最后但并非最不重要RDD还是摆在那里,即便以及背后隐藏的数据帧API
- as far as I can tell simple operations on a
DataFrames
likeselect
orwithColumn
display similar performance to its RDD equivalents likemap
, - in some cases growing number of columns in a typical pipeline can actually degrade performance compared to well tuned low level transformations. You can of course add drop-column-transformers on the way to correct for that,
- many ML algorithms, including
ml.classification.NaiveBayes
are simply a wrappers around itsmllib
API, - PySpark ML/MLlib algorithms delegate actual processing to its Scala counterparts,
- last but not least RDD is still out there, even if well hidden behind DataFrame API
我相信,在您使用ML过MLLib得到什么的一天结束也相当考究,高级API。有一件事你可以做的是结合两种是创建一个自定义的多级管道:
I believe that at the end of the day what you get by using ML over MLLib is quite elegant, high level API. One thing you can do is to combine both is to create a custom multi-step pipeline:
- 使用ML加载,清理和转换数据,
- 提取所需的数据(例如,见<一href=\"https://github.com/apache/spark/blob/098be27ad53c485ee2fc7f5871c47f899020e87b/mllib/src/main/scala/org/apache/spark/ml/$p$pdictor.scala#L123\"相对=nofollow> extractLabeledPoints 方法),并传递给
MLLib
算法 - 添加自定义交叉验证/评估
- 保存
MLLib
模型(星火示范或的)
使用您选择的方法
- use ML to load, clean and transform data,
- extract required data (see for example extractLabeledPoints method) and pass to
MLLib
algorithm, - add custom cross-validation / evaluation
- save
MLLib
model using a method of your choice (Spark model or PMML)
这不是一个最佳的解决方案,但我能想到给当前API的最佳选择之一。
It is not an optimal solution, but is the best one I can think of given a current API.
这篇关于保存ML模型以便将来使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!