问题描述
有没有人知道Spark如何计算其记录数(我认为它与批处理中的事件数相同),如下所示?
Does anyone know how does Spark compute its number of records (I think it is the same as the number of events in a batch), as displayed here?
我试图弄清楚如何远程获取此值(UI中的Streaming选项不存在REST-API).
I'm trying to figure out how I can get this value remotely (REST-API does not exist for Streaming option in the UI).
基本上,我正在尝试获取应用程序处理的记录总数.我需要此信息用于Web门户.
Basically what I'm trying to do it to get the total number of records processed by my application. I need this information for the web portal.
我尝试计算每个阶段的Records
,但是它给了我与上图完全不同的数字.每个阶段都包含有关其记录的信息.如图所示
I tried to count the Records
for each stage, but it gave me completely different number as it is at the picture above. Each stage contain the infomation about its records. As shown here
我正在使用这个简短的python脚本来统计每个阶段的"inputRecords".这是源代码:
I'm using this short python script to count the "inputRecords", from each stage. This is the source code:
import json, requests, urllib
print "Get stages script started!"
#URL REST-API
url = 'http://10.16.31.211:4040/api/v1/applications/app-20161104125052-0052/stages/'
response = urllib.urlopen(url)
data = json.loads(response.read())
stages = []
print len(data)
inputCounter = 0
for item in data:
stages.append(item["stageId"])
inputCounter += item["inputRecords"]
print "Records processed: " + str(inputCounter)
如果我正确理解:每个Batch
具有一个Job
,并且每个Job
具有多个Stages
,则这些Stages
具有多个Tasks
.
If I understood it correctly: Each Batch
has one Job
, and each Job
has multiple Stages
, these Stages
have multiple Tasks
.
所以对我来说,对每个Stage
的输入进行计数是很有意义的.
So for me it made sense to count the input for each Stage
.
推荐答案
Spark在驱动程序上提供了一个指标终结点:
Spark offers a metrics endpoint on the driver:
<driver-host>:<ui-port>/metrics/json
Spark Streaming应用程序将报告UI中可用的所有指标,以及更多其他指标.您可能正在寻找的是:
A Spark Streaming application will report all metrics available in the UI and some more. The ones you are potentially looking for are:
<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalProcessedRecords: {
value: 48574640
},
<driver-id>.driver.<job-id>.StreamingMetrics.streaming.totalReceivedRecords: {
value: 48574640
}
可以自定义此端点.有关信息,请参见 Spark指标.
This endpoint can be customized. See Spark Metrics for info.
这篇关于如何获取Spark Streaming处理的记录总数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!