问题描述
我在Apache Spark作业中使用AWS Java SDK,以从S3提取的数据填充DynamoDB表。 Spark作业仅使用单个 PutItem
s来写入数据,流量非常大(三个m3.xlarge节点仅用于写入),并且没有任何重试策略。
I'm using AWS Java SDK in Apache Spark job to populate DynamoDB table with data extracted from S3. Spark job just writes data using single PutItem
s with very intense flow (three m3.xlarge nodes used only to write) and without any retry policy.
DynamoDB文档,该AWS开发工具包具有退避策略,但是如果速率太高,则最终可以引发 ProvisionedThroughputExceededException
。我的Spark工作已经工作了三天,并且仅受DynamoDB吞吐量(等于500个单位)的约束,所以我希望这个比率非常高,并且队列非常长,但是我没有任何抛出异常或丢失数据的迹象。
DynamoDB docs state that AWS SDK has backoff policy, but eventually if rate is too high ProvisionedThroughputExceededException
can be raised. My spark job worked for three days and was constrained only by DynamoDB thoughput (equal 500 units) so I expect rate was extremely high and queue was extremely long, however I didn't have any signs of thrown exceptions or lost data.
所以,我的问题是-以很高的速率写入DynamoDB时是否有可能发生异常。
So, my question is - when it is possible to get an exception when writing to DynamoDB with very high rate.
推荐答案
如果您有热分区,也可以获取吞吐量异常。由于吞吐量是在分区之间划分的,因此每个分区的限制均低于总预配置的吞吐量,因此,如果您经常写入同一分区,即使您未使用完全预配置的吞吐量,也可以达到限制。
You can also get throughput exception if you have a hot partition. Because throughput is divided between partitions, each partition has a lower limit than total provisioned throughput, so if you write to the same partition often, you can hit limit even if you are not using full provisioned throughput.
要考虑的另一件事是,如果您短暂超过限制,DynamoDB确实会累积未使用的吞吐量,并使用它来使短时间内的可用吞吐量爆裂。
Another thing to consider is that DynamoDB does accumulate unused throughput and use it to burst throughput available for short duration if you go above your limit briefly.
编辑:DynamoDB现在具有新的自适应容量功能,该功能通过不均地重新分配总吞吐量在某种程度上解决了热分区的问题。
DynamoDB now has new adaptive capacity feature which somewhat solves the problem of hot partitions by redistributing total throughput unequally.
这篇关于DynamoDB:ProvisionedThroughputExceededException何时引发的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!