本文介绍了Apache Spark 读取 S3:无法pickle thread.lock 对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
所以我希望我的 Spark 应用程序从 Amazon 的 S3 中读取一些文本.我写了以下简单的脚本:
So I want my Spark App to read some text from Amazon's S3. I Wrote the following simple script:
import boto3
s3_client = boto3.client('s3')
text_keys = ["key1.txt", "key2.txt"]
data = sc.parallelize(text_keys).flatMap(lambda key: s3_client.get_object(Bucket="my_bucket", Key=key)['Body'].read().decode('utf-8'))
当我执行 data.collect
时,我收到以下错误:
When I do data.collect
I get the following error:
TypeError: can't pickle thread.lock objects
而且我似乎没有在网上找到任何帮助.也许有人设法解决了上述问题?
and I don't seem to find any help online. Have perhaps someone managed to solve the above?
推荐答案
您的 s3_client 不可序列化.
Your s3_client isn't serialisable.
使用 mapPartitions 代替 flatMap,并在 lambda 主体内初始化 s3_client 以避免开销.那将:
Instead of flatMap use mapPartitions, and initialise s3_client inside the lambda body to avoid overhead. That will:
- 在每个 worker 上初始化 s3_client
- 减少初始化开销
这篇关于Apache Spark 读取 S3:无法pickle thread.lock 对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!