分组后吞吐量缓慢

本文介绍了分组后吞吐量缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我注意到，在我的工作中，分组依据"步骤之后，吞吐量(报告的记录数/秒)显着降低.当执行该工作流程步骤时，我发现某些实例的CPU利用率约为30％，而另一些实例似乎处于空闲状态.

I noticed that in my jobs the throughput (reported number of records/sec) slows down significantly after a "group by" step.When that workflow step executes, I see that some instances have CPU utilization of ~30%, while some seem to be idle.

是数据流问题还是应该以某种方式指示工作流增加此步骤的并行性?

Is it a dataflow issue or should I somehow instruct the workflow to increase the parallelism of this step?

谢谢，G

推荐答案

在不知道有关管道正在执行的操作的更多细节的情况下，很难确定正在发生的事情.

Its hard to know for sure what's happening without knowing more specifics about what your pipeline is doing.

通常，吞吐量(每秒记录数)取决于几个因素，例如

In general throughput (number of records/sec) depends on several factors such as

记录大小.
您的ParDo进行的处理量

通常，GroupByKey构造一个更大的记录，该记录由一个键以及该键的所有值组成；即输入是KV 的集合，而输出是KV >

In general a GroupByKey constructs a larger record consisting of a key and all values with that key; i.e. the input is a collection of KV<K,V> and the output is a collection of KV<K, Iterable<V>>

因此，通常我希望GroupByKey输出的记录比输入记录大得多.由于记录较大，因此它们需要更长的处理时间，因此记录/秒将趋于降低.

As a result, in general I'd expect the records outputted by a GroupByKey are much larger then the input records. Since the records are larger they take longer to process so records/sec would tend to decrease.

在Alpha版本的Dataflow中，CPU利用率低并不意外.目前，Dataflow还没有完全利用所有VM核心来处理工作.许多性能方面的改进都可以改善这一点.

The low CPU utilization is not unexpected in the Alpha release of Dataflow. Right now, Dataflow is not fully taking advantage of all a VMs cores to process work. A number of performance improvements are coming to improve this.

Dataflow当前提供了两个旋钮，用于通过标志来调整并行度

Dataflow currently provides two knobs for tuning the amount of parallelism via the flags

--numWorkers=<integer>
--workerMachineType=<Name of GCE VM Machine Type>

-numWorkers允许您增加或减少用于并行处理数据的工作器数量.通常，增加工作程序数量可以并行处理更多数据.

--numWorkers allows you to increase or decrease the number of workers used to process your data in parallel. In general, increasing the number of workers allows more data to be processed in parallel.

使用--workerMachineType，您可以选择具有更多或更少CPU或RAM的计算机.

Using --workerMachineType you can pick a machine with more or less CPU or RAM.

如果您发现VM的CPU使用不足，则可以选择一台CPU较少的计算机(默认情况下，Dataflow每个VM使用4个CPU).如果您减少每台计算机的CPU数量，但增加numWorkers数量，以使CPU总数大致相同，则可以在不增加工作成本的情况下增加并行度.

If you notice your VM's CPU being underutilized you can pick a machine with fewer CPUs (by default Dataflow uses 4 CPUs per VM). If you reduce the CPUs per machine but increase numWorkers so that total number of CPUs is about the same, you might be able to increase the amount of parallelism without increasing the cost of your job.

现在，Dataflow仅提供这些非常粗糙的旋钮，用于在全局级别(而不是每个阶段级别)上控制并行度.将来可能会改变.但是，总的来说，我们的目标是为您自动调整并行度，因此您不必这样做.

Right now Dataflow only gives these very coarse knobs for controlling the amount of parallelism on a global level (as opposed to per stage level). This might change in the future. However, in general our goal is to automatically tune the amount of parallelism for you, so you don't have to.

这篇关于分组后吞吐量缓慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！