问题描述
当尝试使用 joblib.load()
从s3读取文件时,尝试读取文件时出现错误 ValueError:Embedded null byte
.
When trying to read a file from s3 with joblib.load()
I get the error ValueError: embedded null byte
when attempting to read files.
这些文件是由joblib创建的,可以从本地副本(在上载到s3之前在本地制作)中成功加载,因此错误可能出在S3的存储和检索协议中.
The files were created by joblib and can be successfully loaded from local copies (that were made locally before uploading to s3), so the error is presumably in storage and retrieval protocols from S3.
最小密码:
####Imports (AWS credentials assumed)
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
joblib.loads(s3.Bucket(bucket_str).Object(bucket_key).get()['Body'].read())
推荐答案
以下代码在馈入 joblib.load()
之前,在内存中重建了文件的本地副本,从而成功加载.
The following code reconstructs a local copy of the file in memory before feeding into joblib.load()
, enabling a successful load.
from io import BytesIO
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
with BytesIO() as data:
s3.Bucket(bucket_str).download_fileobj(bucket_key, data)
data.seek(0) # move back to the beginning after writing
df = joblib.load(data)
我假设但不确定,boto3块文件下载方式中的某些内容创建了一个空字节,该空字节破坏了joblib,并且BytesIO在让 joblib.load()
看到数据流之前对此进行了修复.
I assume, but am not certain, that something in how boto3 chunks files for download creates a null byte that breaks joblib, and BytesIO fixes this before letting joblib.load()
see the datastream.
PS.在这种方法下,文件永远不会接触本地磁盘,这在某些情况下很有用(例如,具有大内存但磁盘空间很小的节点……)
PS. In this method the file never touches the local disk, which is helpful under some circumstances (eg. node with big RAM but tiny disk space...)
这篇关于从s3读取文件时joblib.load中的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!