本文介绍了从pyspark作业在s3存储桶中动态创建文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将数据写入s3存储桶,并使用pyspark创建实木复合地板文件.我的存储桶结构如下所示:

I am writing data into s3 bucket and creating parquet files using pyspark . MY bucket structure looks like below:

s3a://rootfolder/subfolder/table/

如果文件夹不存在,则应在运行时创建这两个文件夹的子文件夹和表,如果文件夹存在,则应该在文件夹表内创建镶木地板文件.

subfolder and table these two folders should be created at run time if folders do not exist , and if folders exist parquet files should inside folder table .

当我从本地计算机运行pyspark程序时,它会创建带有_ $ folder $的额外文件夹(例如 table_ $ folder $ ),但是如果从emr运行相同的程序,它将使用_SUCCESS创建.

when I am running pyspark program from local machine it creates extra folder with _$folder$ (like table_$folder$ ) but if same program is run from emr it creates with _SUCCESS .

writing into s3: (pyspark program)
 data.write.parquet("s3a://rootfolder/sub_folder/table/", mode="overwrite")

是否可以在s3中仅创建文件夹(如果不存在),并且不创建诸如table_ $ folder $或_SUCCESS的文件夹.

is there way that creates only folder in s3 if do not exist and do not create folders like table_$folder$ or with _SUCCESS .

推荐答案

s3a连接器( org.apache.hadoop.fs.s3a.S3AFileSystem )不会创建 $ folder $ 文件.它生成目录标记作为路径+/,.例如, mkdir s3a://bucket/a/b 创建一个零字节标记对象/a/b/.这与文件不同,该文件的路径为/a/b

s3a connector (org.apache.hadoop.fs.s3a.S3AFileSystem) doesn't create $folder$ files. It generates directory markers as path + /, . For example, mkdir s3a://bucket/a/b creates a zero bytes marker object /a/b/. This differentiates it from a file, which would have the path /a/b

  1. 如果在本地使用的是 s3n :URL.停下来.使用S3a连接器.
  2. 如果您一直在设置 fs.s3a.impl 选项:请将其停止.hadoop知道要使用什么,并且使用了S3AFileSystem类
  3. 如果看到它们并且正在运行EMR,则这是EMR的连接器.封闭源,超出范围.
  1. If, locally, you are using the s3n: URL. Stop it. use the S3a connector.
  2. If you have been setting the fs.s3a.impl option: stop it. hadoop knows what to use, and it uses the S3AFileSystem class
  3. If you are seeing them and you are running EMR, that's EMR's connector. Closed source, out of scope.

这篇关于从pyspark作业在s3存储桶中动态创建文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 14:48