问题描述
我正在尝试在我们的测试集群上的 Flink 上运行 Apache Beam 管道.它在 在通过序列化对对象进行编码期间.我还没有能够在本地重现该错误.您可以在此处找到整个作业日志.某些值已被假数据替换.
I'm trying to run an Apache Beam pipeline on Flink on our test cluster. It has been failing with an EOFException
at org.apache.flink.runtime.io.disk.SimpleCollectingOutputView:79
during the encoding of an object through serialisation. I haven't been able to reproduce the error locally, yet. You can find the entire job log here. Some values have been replaced with fake data.
用于运行管道的命令:
bin/flink run \
-m yarn-cluster \
--yarncontainer 1 \
--yarnslots 4 \
--yarnjobManagerMemory 2000 \
--yarntaskManagerMemory 2000 \
--yarnname "EBI" \
pipeline.jar \
--runner=FlinkRunner \
--zookeeperQuorum=hdp-master-001.fake.org:2181
虽然我认为它不相关,但要序列化的对象是可序列化的,并且具有隐式和显式编码器,但这并不影响情况.
While I think it's not related, the object-to-be-serialised is serialisable and has had both an implicit and an explicit coder, but this doesn't affect the situation.
可能导致这种情况的原因是什么,我可以做些什么来解决它?
What might be causing this situation and what can I do to address it?
就目前而言,将管理器的堆内存增加到 4 到 8GiB 之间似乎可以防止出现异常.仍然不确定这是否应该是正常的 Flink 行为(它不应该溢出到磁盘吗?).似乎不是一个可以扩展的解决方案.
推荐答案
EOFException
被抛出,因为 Flink 耗尽了内存缓冲区.Flink 期望 EOFException
作为通知开始将数据写入磁盘.
The EOFException
is thrown because Flink ran out of memory buffers. Flink expects an EOFException
as a notification to start to write data to disk.
这个问题是由 Beam 的 SerializableCoder
将 EOFException
包装在一个 CoderException
中引起的.因此,Flink 没有捕获预期的 EOFException
并失败.
This problem is caused by Beam's SerializableCoder
wraps the EOFException
in a CoderException
. Hence, Flink does not catch the expected EOFException
and fails.
该问题可以通过使用不包装 EOFException
而是转发它的自定义编码器来解决.
The problem can be solved by using a custom coder that does not wrap the EOFException
but forwards it.
这篇关于在 Flink 上运行 Beam 管道期间与内存段相关的 EOFException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!