问题描述
我有一个LIBSVM缩放模型(由svm-scale生成),我想移植到PySpark.我已经天真地尝试了以下方法:
I have a LIBSVM scaling model (generated with svm-scale) that I would like to port over to PySpark. I've naively tried the following:
scaler_path = "path to model"
a = MinMaxScaler().load(scaler_path)
但是我抛出了一个错误,需要一个元数据目录:
But I'm thrown an error, expecting a metadata directory:
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-22-1942e7522174> in <module>()
----> 1 a = MinMaxScaler().load(scaler_path)
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(cls, path)
226 def load(cls, path):
227 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 228 return cls.read().load(path)
229
230
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(self, path)
174 if not isinstance(path, basestring):
175 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 176 java_obj = self._jread.load(path)
177 if not hasattr(self._clazz, "_from_java"):
178 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o321.load.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:[filename]/metadata
```
是否有一个简单的解决方法来加载它? LIBSVM模型的格式为
Is there a simple work-around for loading this? The format of the LIBSVM model is
x
0 1
1 -1050 1030
2 0 1
3 0 3
4 0 1
5 0 1
推荐答案
首先,显示的文件不是libsvm格式. libsvm文件的正确格式如下:
First, the file presented isn't in libsvm format. The correct format of a libsvm file is the following :
<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>
因此,您的数据准备不正确.
Thus your data preparation is incorrect to start with.
第二,与MinMaxScaler
一起使用的类方法load(path)
从输入路径读取ML实例.
Secondly, the class method load(path)
that you are using with MinMaxScaler
reads an ML instance from the input path.
请记住: MinMaxScaler
计算数据集的摘要统计信息并生成MinMaxScalerModel
.然后,该模型可以分别变换每个特征,使其处于给定范围内.
Remember that : MinMaxScaler
computes summary statistics on a data set and produces a MinMaxScalerModel
. The model can then transform each feature individually such that it is in the given range.
例如:
from pyspark.ml.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import MinMaxScaler
df = spark.createDataFrame([(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])) ,(0.0, Vectors.dense([1.01, 2.02, 3.03]))],['label','features'])
df.show(truncate=False)
# +-----+---------------------+
# |label|features |
# +-----+---------------------+
# |1.1 |(3,[0,2],[1.23,4.56])|
# |0.0 |[1.01,2.02,3.03] |
# +-----+---------------------+
mmScaler = MinMaxScaler(inputCol="features", outputCol="scaled")
temp_path = "/tmp/spark/"
minMaxScalerPath = temp_path + "min-max-scaler"
mmScaler.save(minMaxScalerPath)
以上代码段将保存MinMaxScaler
功能转换器,因此可以在加载类方法之后加载.
The snippet above will save the MinMaxScaler
feature transformer so it can be loaded after with the class method load.
现在,让我们看看实际发生了什么.类方法save
将创建以下文件结构:
Now, let's take a look at what actually happened. The class method save
will create the following file structure :
/tmp/spark/
└── min-max-scaler
└── metadata
├── part-00000
└── _SUCCESS
让我们检查该part-0000
文件的内容:
Let's check the content of that part-0000
file :
$ cat /tmp/spark/min-max-scaler/metadata/part-00000 | python -m json.tool
{
"class": "org.apache.spark.ml.feature.MinMaxScaler",
"paramMap": {
"inputCol": "features",
"max": 1.0,
"min": 0.0,
"outputCol": "scaled"
},
"sparkVersion": "2.0.0",
"timestamp": 1480501003244,
"uid": "MinMaxScaler_42e68455a929c67ba66f"
}
实际上,当您加载变压器时:
So actually when you load the transformer :
loadedMMScaler = MinMaxScaler.load(minMaxScalerPath)
您实际上是在加载该文件. 不会获取libsvm文件!
You are actually load that file. It won't take a libsvm file !
现在,您可以应用转换器创建模型并转换DataFrame
:
Now you can apply your transformer to create the model and transform your DataFrame
:
model = loadedMMScaler.fit(df)
model.transform(df).show(truncate=False)
# +-----+---------------------+-------------+
# |label|features |scaled |
# +-----+---------------------+-------------+
# |1.1 |(3,[0,2],[1.23,4.56])|[1.0,0.0,1.0]|
# |0.0 |[1.01,2.02,3.03] |[0.0,1.0,0.0]|
# +-----+---------------------+-------------+
现在,让我们回到该libsvm文件,让我们创建一些虚拟数据,并使用MLUtils
Now let's get back to that libsvm file and let us create some dummy data and save it to a libsvm format using MLUtils
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.util import MLUtils
data = sc.parallelize([LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))])
MLUtils.saveAsLibSVMFile(data, temp_path + "data")
回到我们的文件结构:
/tmp/spark/
├── data
│ ├── part-00000
│ ├── part-00001
│ ├── part-00002
│ ├── part-00003
│ ├── part-00004
│ ├── part-00005
│ ├── part-00006
│ ├── part-00007
│ └── _SUCCESS
└── min-max-scaler
└── metadata
├── part-00000
└── _SUCCESS
您现在可以检查libsvm格式的那些文件的内容:
You can check the content of those file which is in libsvm format now :
$ cat /tmp/spark/data/part-0000*
1.1 1:1.23 3:4.56
0.0 1:1.01 2:2.02 3:3.03
现在让我们加载该数据并应用:
Now let's load that data and apply :
loadedData = MLUtils.loadLibSVMFile(sc, temp_path + "data")
loadedDataDF = spark.createDataFrame(loadedData.map(lambda lp : (lp.label, lp.features.asML())), ['label','features'])
loadedDataDF.show(truncate=False)
# +-----+----------------------------+
# |label|features |
# +-----+----------------------------+
# |1.1 |(3,[0,2],[1.23,4.56]) |
# |0.0 |(3,[0,1,2],[1.01,2.02,3.03])|
# +-----+----------------------------+
注意,将MLlib Vectors
转换为ML Vectors
非常重要.您可以在此处中了解更多信息.. >
Note that converting MLlib Vectors
to ML Vectors
is very important. You can read more about it here.
model.transform(loadedDataDF).show(truncate=False)
# +-----+----------------------------+-------------+
# |label|features |scaled |
# +-----+----------------------------+-------------+
# |1.1 |(3,[0,2],[1.23,4.56]) |[1.0,0.0,1.0]|
# |0.0 |(3,[0,1,2],[1.01,2.02,3.03])|[0.0,1.0,0.0]|
# +-----+----------------------------+-------------+
我希望这能回答您的问题!
I hope that this answers your question!
这篇关于如何将LIBSVM模型(使用LIBSVM保存)读入PySpark?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!