问题描述
我正在将数据写入s3存储桶,并使用pyspark创建实木复合地板文件.我的存储桶结构如下所示:
I am writing data into s3 bucket and creating parquet files using pyspark . MY bucket structure looks like below:
s3a://rootfolder/subfolder/table/
如果文件夹不存在,则应在运行时创建这两个文件夹的子文件夹和表,如果文件夹存在,则应该在文件夹表内创建镶木地板文件.
subfolder and table these two folders should be created at run time if folders do not exist , and if folders exist parquet files should inside folder table .
当我从本地计算机运行pyspark程序时,它会创建带有_ $ folder $的额外文件夹(例如 table_ $ folder $
),但是如果从emr运行相同的程序,它将使用_SUCCESS创建.
when I am running pyspark program from local machine it creates extra folder with _$folder$ (like table_$folder$
) but if same program is run from emr it creates with _SUCCESS .
writing into s3: (pyspark program)
data.write.parquet("s3a://rootfolder/sub_folder/table/", mode="overwrite")
是否可以在s3中仅创建文件夹(如果不存在),并且不创建诸如table_ $ folder $或_SUCCESS的文件夹.
is there way that creates only folder in s3 if do not exist and do not create folders like table_$folder$ or with _SUCCESS .
推荐答案
s3a连接器( org.apache.hadoop.fs.s3a.S3AFileSystem
)不会创建 $ folder $
文件.它生成目录标记作为路径+/,.例如, mkdir s3a://bucket/a/b
创建一个零字节标记对象/a/b/
.这与文件不同,该文件的路径为/a/b
s3a connector (org.apache.hadoop.fs.s3a.S3AFileSystem
) doesn't create $folder$
files. It generates directory markers as path + /, . For example, mkdir s3a://bucket/a/b
creates a zero bytes marker object /a/b/
. This differentiates it from a file, which would have the path /a/b
- 如果在本地使用的是
s3n
:URL.停下来.使用S3a连接器. - 如果您一直在设置
fs.s3a.impl
选项:请将其停止.hadoop知道要使用什么,并且使用了S3AFileSystem类 - 如果看到它们并且正在运行EMR,则这是EMR的连接器.封闭源,超出范围.
- If, locally, you are using the
s3n
: URL. Stop it. use the S3a connector. - If you have been setting the
fs.s3a.impl
option: stop it. hadoop knows what to use, and it uses the S3AFileSystem class - If you are seeing them and you are running EMR, that's EMR's connector. Closed source, out of scope.
这篇关于从pyspark作业在s3存储桶中动态创建文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!