pyspark中的--files选项不起作用

本文介绍了pyspark中的--files选项不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从命令行尝试了 sc.addFile 选项(没有任何问题)和-files 选项(失败).

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).

运行1:spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))

外部软件包:external_package.py

class external(object):
    def __init__(self):
        pass
    def fun(self,input):
        return input*2

readme.txt

MY TEXT HERE

火花提交命令

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  /local-pgm-path/spark_distro.py  \
  1000

输出:按预期工作

['MY TEXT HERE']

但是，如果我尝试使用--files(而不是sc.addFile)选项从命令行传递文件(readme.txt)，则会失败.如下所示.

But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing.Like below.

运行2:spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))

external_package.py 与上述相同

火花提交

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  --files /local-path/readme.txt#readme.txt  \
  /local-pgm-path/spark_distro.py  \
  1000

输出:

Traceback (most recent call last):
  File "/local-pgm-path/spark_distro.py", line 31, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'

sc.addFile 和-file 是否用于相同目的?有人可以分享您的想法吗?

Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.

推荐答案

我终于弄清楚了这个问题，它确实是一个非常微妙的问题.

I have finally figured out the issue, and it is a very subtle one indeed.

怀疑，这两个选项( sc.addFile 和-files )不等同，这是(很巧妙地))提示文档(强调):

As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):

-文件文件
以逗号分隔的要放置的文件列表每个执行者的目录.

以通俗的英语讲，虽然添加了 sc.addFile 的文件对执行者和驱动程序都可用，但是添加了-files 的文件仅对执行者可用.;因此，当尝试从驱动程序访问它们时(如OP中的情况)，我们会收到无此类文件或目录错误.

In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.

让我们确认一下(除去OP中所有不相关的-py-files 和 1000 东西):

Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):

test_fail.py :

test_fail.py:

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)

测试:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_fail.py

结果:

[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
  File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'

在上面的脚本 test_fail.py 中，是 driver 程序请求访问文件 readme.txt ；让我们更改脚本，以便请求执行者( test_success.py )的访问权限:

In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)

lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())

测试:

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_success.py

结果:

[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']

还请注意，这里我们不需要 SparkFiles.get -该文件易于访问.

Notice also that here we don't need SparkFiles.get - the file is readily accessible.

如上所述， sc.addFile 在两种情况下都可以工作，即当驱动程序或执行者请求访问(经过测试但未在此处显示)时.

As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).

关于命令行选项的顺序:正如我所说的那样，所有与Spark相关的参数都必须位于要执行的脚本之前；可以说，-files 和-py-files 的相对顺序无关紧要(将其作为练习).

Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).

同时测试了 Spark 1.6.0 和& 2.2.0 .

Tested with both Spark 1.6.0 & 2.2.0.

更新(在注释之后):似乎我的 fs.defaultFS 设置也指向HDFS:

UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:

$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020

但是让我专注于这里的森林(而不是树木)，并说明为什么整个讨论只具有学术意义:

But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:

使用-files 标志传递要处理的文件是错误的做法；事后看来，我现在明白为什么我几乎找不到在线使用参考-可能没有人在实践中使用它，并且有充分的理由.

Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.

(请注意，我不是在谈论-py-files ，它起着不同的合法作用.)

(Notice that I am not talking for --py-files, which serves a different, legitimate role.)

由于Spark是一个分布式处理框架，在集群和分布式文件系统(HDFS)上运行，因此最好的办法是-将所有文件都处理到HDFS中-期间.Spark要处理的文件的自然"位置是HDFS，而不是本地FS-尽管有一些玩具示例使用本地FS仅用于演示目的.更重要的是，如果您希望将来有一段时间将部署模式更改为 cluster ，则会发现该群集默认情况下不了解本地路径和文件，这是正确的.

Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

这篇关于pyspark中的--files选项不起作用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！