本文介绍了AWS S3的Sklearn Joblib加载功能IO错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从sklearn-learn加载我的分类器的pkl转储.

I am trying to load a pkl dump of my classifier from sklearn-learn.

对于我的对象,joblib转储的压缩效果比cPickle转储的压缩效果好得多,所以我想坚持下去.但是,尝试从AWS S3读取对象时出现错误.

The joblib dump does a much better compression than the cPickle dump for my object so I would like to stick with it. However, I am getting an error when trying to read the object from AWS S3.

情况:

  • 本地托管的Pkl对象:pickle.load有效,joblib.load有效
  • 通过应用程序将Pkl对象推送到Heroku(从静态文件夹加载):pickle.load有效,joblib.load有效
  • 将Pkl对象推送到S3:pickle.load有效,joblib.load返回IOError. (通过heroku应用进行测试,并通过本地脚本进行测试)

请注意,joblib和pickle的pkl对象是使用各自方法转储的不同对象. (即joblib仅加载joblib.dump(obj),而pickle仅加载cPickle.dump(obj).

Note that the pkl objects for joblib and pickle are different objects dumped with their respective methods. (i.e. joblib loads only joblib.dump(obj) and pickle loads only cPickle.dump(obj).

Joblib与cPickle代码

Joblib vs cPickle code

# case 2, this works for joblib, object pushed to heroku
resources_dir = os.getcwd() + "/static/res/" # main resource directory
input = joblib.load(resources_dir + 'classifier.pkl')

# case 3, this does not work for joblib, object hosted on s3
aws_app_assets = "https://%s.s3.amazonaws.com/static/res/" % keys.AWS_BUCKET_NAME
classifier_url_s3 = aws_app_assets + 'classifier.pkl'

# does not work with raw url, IO Error
classifier = joblib.load(classifier_url_s3)

# urrllib2, can't open instance
# TypeError: coercing to Unicode: need string or buffer, instance found
req = urllib2.Request(url=classifier_url_s3)
f = urllib2.urlopen(req)
classifier = joblib.load(urllib2.urlopen(classifier_url_s3))

# but works with a cPickle object hosted on S3
classifier = cPickle.load(urllib2.urlopen(classifier_url_s3))

我的应用程序在情况2下可以正常工作,但是由于加载速度非常慢,我想尝试将所有静态文件(尤其是这些pickle dumps)推送到S3. Joblib加载与Pickle加载的方式之间固有的差异会导致此错误吗?

My app works fine in case 2, but because of very slow loading, I wanted to try and push all static files out to S3, particularly these pickle dumps. Is there something inherently different about the way joblib loads vs pickle that would cause this error?

这是我的错误

File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 409, in load
with open(filename, 'rb') as file_handle:
IOError: [Errno 2] No such file or directory: classifier url on s3
[Finished in 0.3s with exit code 1]

这不是权限问题,因为我已将s3上的所有对象公开进行测试,并且pickle.dump对象可以正常加载.如果我直接将URL输入浏览器,则还会下载joblib.dump对象

It is not a permissions issue as I've made all my objects on s3 public for testing and the pickle.dump objects load fine. The joblib.dump object also downloads if I directly enter the url into the browser

我可能会完全缺少一些东西.

I could be completely missing something.

谢谢.

推荐答案

joblib.load()需要文件系统上存在的文件名.

joblib.load() expects a name of the file present on filesystem.

Signature: joblib.load(filename, mmap_mode=None)
Parameters
-----------
filename: string
    The name of the file from which to load the object

此外,即使您不介意腌制的模型可供全世界使用,公开所有资源对于其他资产也可能不是一个好主意.

Moreover, making all your resources public might not be a good idea for other assets, even you don't mind pickled model being accessible to the world.

首先将对象从S3复制到工作人员的本地文件系统非常简单:

It is rather simple to copy object from S3 to local filesystem of your worker first:

from boto.s3.connection import S3Connection
from sklearn.externals import joblib
import os

s3_connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
s3_bucket = s3_connection.get_bucket(keys.AWS_BUCKET_NAME)
local_file = '/tmp/classifier.pkl'
s3_bucket.get_key(aws_app_assets + 'classifier.pkl').get_contents_to_filename(local_file)
clf = joblib.load(local_file)
os.remove(local_file)

希望这会有所帮助.

P.S.您可以使用这种方法来腌制整个sklearn管道.这也包括特征插补.只是要注意训练和预测之间库的版本冲突.

P.S. you can use this approach to pickle the entire sklearn pipeline. This includes also feature imputation. Just beware of version conflicts of libraries between training and predicting.

这篇关于AWS S3的Sklearn Joblib加载功能IO错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 01:06