本文介绍了跳过标题行-Cloud DataFlow是否可能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我创建了一个管道,该管道从GCS中的文件中读取,转换并最终写入BQ表.该文件包含标题行(字段).
I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields).
有什么方法可以像加载时一样在BQ中以编程方式设置要跳过的标题行数"吗?
Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when loading in?
推荐答案
当前无法实现.听起来这里有两个潜在的请求:
This is not currently possible. It sounds like there are two potential requests here:
- 为BigQuery导入指定标题行的存在和跳过行为.
- 指定GCS文本源应跳过标题行.
对此的未来工作在 https://issues.apache.org/中进行了跟踪jira/browse/BEAM-123 .
同时,您可以在ParDo代码中添加一个简单的过滤器以跳过标头.像这样:
Also, in the meantime, you could add a simple filter to your ParDo code to skip headers. Something like this:
PCollection<X> rows = ...;
PCollection<X> nonHeaders =
rows.apply(Filter.by(new MatchIfNonHeader()));
这篇关于跳过标题行-Cloud DataFlow是否可能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!