问题描述
我需要从GCS存储桶中读取文件。我知道我将不得不使用GCS API /客户端库,但我找不到任何与之相关的示例。
I need to read a file from a GCS bucket. I know I'll have to use GCS API/Client Libraries but I cannot find any example related to it.
我在GCS文档中一直指的是这个链接:
。但实际上并没有成功。如果有人能提供一个真正有用的例子。
谢谢。
I have been referring to this link in the GCS documentation:GCS Client Libraries. But couldn't really make a dent. If anybody can provide an example that would really help.Thanks.
推荐答案
好的。如果您只想从GCS中读取文件,而不是作为PCollection而是作为常规文件,并且如果您在使用GCS Java客户端库时遇到问题,您还可以使用Apache Beam API:
OK. If you want to simply read files from GCS, not as a PCollection but as regular files, and if you are having trouble with the GCS Java client libraries, you can also use the Apache Beam FileSystems API:
首先,您需要确保 pom.xml
中的<$ c $>具有Maven依赖关系c> beam-sdks-java-extensions-google-cloud-platform-core 其中包含 gs://
文件系统的实现:
First, you need to make sure that you have a Maven dependency in your pom.xml
on beam-sdks-java-extensions-google-cloud-platform-core
which contains implementation of the gs://
filesystem:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-extensions-google-cloud-platform-core</artifactId>
</dependency>
然后设置FileSystems API(默认设置在所有管道中,但是如果你'在管道外重新使用它,你需要手动完成它。
Then set up the FileSystems API (it is set up by default in all pipelines, but if you're using it outside a pipeline, you need to do it manually).
PipelineOptions options = PipelineOptionsFactory.create();
// ...Optionally fill in options such as GCP credentials...
// (see GcpOptions class)
FileSystems.setDefaultPipelineOptions(options);
然后你可以使用它:
ReadableByteChannel chan = FileSystems.open(FileSystems.matchNewResource(
"gs://path/to/your/file", false /* is_directory */));
try (InputStream stream = Channels.newInputStream(chan)) {
// Use regular Java utilities to work with the input stream.
}
这篇关于从Apache Beam中的GCS读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!