问题描述
Pcollection<String> p1 = {"a","b","c"}
PCollection< KV<Integer,String> > p2 = p1.apply("some operation ")
//{(1,"a"),(2,"b"),(3,"c")}
我需要使其可扩展以适用于Apache Spark之类的大文件,使其工作方式如下:
I need to make it scalable for large file like Apache Spark such that it works like:
sc.textFile("./filename").zipWithIndex
我的目标是通过以可扩展的方式分配行号来保留大文件中行之间的顺序.
My goal is to preserve the order between rows within a large file by assigning row numbers in a scalable way.
我如何通过Apache Beam获得结果?
How can I get the result by Apache Beam?
一些相关职位: Apache Flink上的zipWithIndex
推荐答案
这很有趣.因此,如果我了解您的算法,它将类似于(伪代码):
This is interesting. So if I understand your algorithm, it would be something like (pseudocode):
A = ReadWithShardedLineNumbers(myFile) : output K<ShardOffset+LocalLineNumber>, V<Data>
B = A.ExtractShardOffsetKeys() : output K<ShardOffset>, V<LocalLineNumber>
C = B.PerKeySum() : output K<ShardOffset>, V<ShardTotalLines>
D = C.GlobalSortAndPrefixSum() : output K<ShardOffset> V<ShardLineNumberOffset>
E = [A,D].JoinAndCalculateGlobalLineNumbers() : output V<GlobalLineNumber+Data>
这有两个假设:
-
ReadWithShardedLineNumbers
:源可以输出其分片偏移量,并且偏移量是全局排序的 -
GlobalSortAndPrefixSum
:所有读取分片的总数可以容纳在内存中,以执行总排序
ReadWithShardedLineNumbers
: Sources can output their shard offset, and the offsets are globally orderedGlobalSortAndPrefixSum
: The totals for all read shards can fit in memory to perform a total sort
假设2并非对所有数据大小都成立,并且取决于流水号,具体取决于读取分片的粒度.但是对于一些实际的文件大小子集来说,这似乎是可行的.
Assumption #2 will not hold true for all data sizes, and varies by runner depending on how granular the read shards are. But it seems feasible for some practical subset of file-sizes.
此外,我相信上面的伪代码可以在Beam中表示,并且不需要SplittableDoFn.
Also, I believe the pseudo-code above is representable in Beam, and would not require SplittableDoFn.
这篇关于如何在Apache Beam中像Spark一样实现zipWithIndex?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!