如何获取Spark Streaming处理的记录总数? | Streaming处理的记录总数

本文介绍了如何获取Spark Streaming处理的记录总数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有人知道Spark如何计算其记录数(我认为它与批处理中的事件数相同)，如下所示?

Does anyone know how does Spark compute its number of records (I think it is the same as the number of events in a batch), as displayed here?

我试图弄清楚如何远程获取此值(UI中的Streaming选项不存在REST-API).

I'm trying to figure out how I can get this value remotely (REST-API does not exist for Streaming option in the UI).

基本上，我正在尝试获取应用程序处理的记录总数.我需要此信息用于Web门户.

Basically what I'm trying to do it to get the total number of records processed by my application. I need this information for the web portal.

我尝试计算每个阶段的Records，但是它给了我与上图完全不同的数字.每个阶段都包含有关其记录的信息.如图所示

I tried to count the Records for each stage, but it gave me completely different number as it is at the picture above. Each stage contain the infomation about its records. As shown here

我正在使用这个简短的python脚本来统计每个阶段的"inputRecords".这是源代码:

I'm using this short python script to count the "inputRecords", from each stage. This is the source code:

import json, requests, urllib
print "Get stages script started!"
#URL REST-API
url = 'http://10.16.31.211:4040/api/v1/applications/app-20161104125052-0052/stages/'
response = urllib.urlopen(url)
data = json.loads(response.read())

stages = []
print len(data)
inputCounter = 0
for item in data:
        stages.append(item["stageId"])
        inputCounter += item["inputRecords"]
print "Records processed: " + str(inputCounter)

如果我正确理解:每个Batch具有一个Job，并且每个Job具有多个Stages，则这些Stages具有多个Tasks.

If I understood it correctly: Each Batch has one Job, and each Job has multiple Stages, these Stages have multiple Tasks.

所以对我来说，对每个Stage的输入进行计数是很有意义的.

So for me it made sense to count the input for each Stage.

推荐答案

Spark在驱动程序上提供了一个指标终结点:

Spark offers a metrics endpoint on the driver:

<driver-host>:<ui-port>/metrics/json

Spark Streaming应用程序将报告UI中可用的所有指标，以及更多其他指标.您可能正在寻找的是:

A Spark Streaming application will report all metrics available in the UI and some more. The ones you are potentially looking for are:

<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalProcessedRecords: {
value: 48574640
},
<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalReceivedRecords: {
value: 48574640
}

可以自定义此端点.有关信息，请参见 Spark指标.

This endpoint can be customized. See Spark Metrics for info.

这篇关于如何获取Spark Streaming处理的记录总数?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！