前几天提交了一篇ganglia监控storm集群的博文,本文将介绍使用ganglia监控mongdb集群。因为我们需要使用ganglia一统天下。
1. ganglia扩展机制
    首先要使用ganglia监控mongodb集群必须先明白ganglia的扩展机制。通过ganglia插件可以给我们提供两种扩展ganglia监控功能的方法:

    1)、通过添加内嵌(in-band)插件,主要是通过gmetric命令来实现。

    这是通常使用的一种方法,主要是通过cronjob方法并调用ganglia的gmetric命令来向gmond输入数据,进而实现统一监控,这种方法简单,对于少量的监控可以采用,但是对于大规模自定义监控时,监控数据难以统一管理。

    2)、通过添加一些额外的脚本来实现对系统的监控,主要是通过C或者python接口来实现。

    在ganglia3.1.x版本以后,增加了C或者Python接口,通过这个接口可以自定义数据收集模块,并且这些模块可以被直接插入到gmond中以监控用户自定义的应用。
2. python脚本监控mongdb
    我们使用python脚本来监控mongodb集群,毕竟通过python脚本扩展比较方便,需要增加监控信息时在相应的py脚本中添加监控数据就可以了,十分方便,扩展性强,移植也比较简单。
2.1 环境配置
    要使用python脚本来实现ganglia监控扩展,首先需要明确modpython.so文件是否存在,这个文件是ganglia调用python的动态链接库,要通过python接口开发ganglia插件,必须要编译安装此模块。modpython.so文件存放在ganglia安装目录下的lib(or lib64)/ganglia/目录中。如果存在则可以进行下面的脚本编写;如果不存在,那么需要你重新编译安装gmond哦,编译安装时带上参数“--with-python”。
2.2 编写监控脚本
     我们打开ganglia安装目录下的/etc/gmond.conf文件,可以发现在客户端监控中可以看到include ("/usr/local/ganglia/etc/conf.d/*.conf"),说明gmond服务直接扫描目录下的监控配置文件,所以我们需要将监控配置脚本放在/etc/conf.d/目录下并命名为XX.conf,所以我们将要监控mongdb的配置脚本命名为mongdb.conf
    1)、查看modpython.conf文件
    modpython.conf位于/etc/conf.d/目录下。文件内容如下:

点击(此处)折叠或打开

  1. modules {
  2. module {
  3. name = "python_module"   #主模块文成
  4. path = "modpython.so"    #ganglia扩展python脚本需要的动态链接库
  5. params = "/usr/local/ganglia/lib64/ganglia/python_modules"     #python脚本存放的位置
  6. }
  7. }

  8. include ("/usr/local/ganglia/etc/conf.d/*.pyconf")     #ganglia扩展存放配置脚本的路径  
    所以我们使用python来扩展ganglia监控mongodb需要将配置脚本和py脚本放在相应的目录下,再重启ganglia服务就可以完成mongdb监控,下面将介绍如何编写脚本。
    2)、创建mongodb.pyconf脚本
    注意这里需要使用root权限来创建编辑脚本,将此脚本存放在conf.d目录下。具体要收集mongdb那些参数可以参考https://github.com/ganglia/gmond_python_modules/tree/master/mongodb,根据自己的需求酌量增删。

点击(此处)折叠或打开

  1. modules {
  2.   module {
  3.     name = "mongodb"   #模块名,该模块名必须与开发的存放于"/usr/lib64/ganglia/python_modules"指定的路径下的python脚本名称一致
  4.     language = "python"   #声明使用python语言
  5.     #参数列表,所有的参数作为一个dict(即map)传给python脚本的metric_init(params)函数。
  6.     param server_status{
  7.         value = "mongo路径 --host host --port 27017 --quiet --eval 'printjson(db.serverStatus())'"
  8.     }
  9.     param rs_status{
  10.         value = "mongo路径 --host host --port 2701 --quiet --eval 'printjson(rs.status())'"
  11.     }
  12.   }
  13. }

  14. #需要收集的metric列表,一个模块中可以扩展任意个metric
  15. collection_group {
  16.   collect_every = 30
  17.   time_threshold = 90 #最大发送间隔
  18.   metric {
  19.     name = "mongodb_opcounters_insert" #metric在模块中的名字
  20.     title = "Inserts" #图形界面上显示的标题
  21.   }
  22.   metric {
  23.     name = "mongodb_opcounters_query"
  24.     title = "Queries"
  25.   }
  26.   metric {
  27.     name = "mongodb_opcounters_update"
  28.     title = "Updates"
  29.   }
  30.   metric {
  31.     name = "mongodb_opcounters_delete"
  32.     title = "Deletes"
  33.   }
  34.   metric {
  35.     name = "mongodb_opcounters_getmore"
  36.     title = "Getmores"
  37.   }
  38.   metric {
  39.     name = "mongodb_opcounters_command"
  40.     title = "Commands"
  41.   }
  42.   metric {
  43.     name = "mongodb_backgroundFlushing_flushes"
  44.     title = "Flushes"
  45.   }
  46.   metric {
  47.     name = "mongodb_mem_mapped"
  48.     title = "Memory-mapped Data"
  49.   }
  50.   metric {
  51.     name = "mongodb_mem_virtual"
  52.     title = "Process Virtual Size"
  53.   }
  54.   metric {
  55.     name = "mongodb_mem_resident"
  56.     title = "Process Resident Size"
  57.   }
  58.   metric {
  59.     name = "mongodb_extra_info_page_faults"
  60.     title = "Page Faults"
  61.   }
  62.   metric {
  63.     name = "mongodb_globalLock_ratio"
  64.     title = "Global Write Lock Ratio"
  65.   }
  66.   metric {
  67.     name = "mongodb_indexCounters_btree_miss_ratio"
  68.     title = "BTree Page Miss Ratio"
  69.   }
  70.   metric {
  71.     name = "mongodb_globalLock_currentQueue_total"
  72.     title = "Total Operations Waiting for Lock"
  73.   }
  74.   metric {
  75.     name = "mongodb_globalLock_currentQueue_readers"
  76.     title = "Readers Waiting for Lock"
  77.   }
  78.   metric {
  79.     name = "mongodb_globalLock_currentQueue_writers"
  80.     title = "Writers Waiting for Lock"
  81.   }
  82.   metric {
  83.     name = "mongodb_globalLock_activeClients_total"
  84.     title = "Total Active Clients"
  85.   }
  86.   metric {
  87.     name = "mongodb_globalLock_activeClients_readers"
  88.     title = "Active Readers"
  89.   }
  90.   metric {
  91.     name = "mongodb_globalLock_activeClients_writers"
  92.     title = "Active Writers"
  93.   }
  94.   metric {
  95.     name = "mongodb_connections_current"
  96.     title = "Open Connections"
  97.   }
  98.   metric {
  99.     name = "mongodb_connections_current_ratio"
  100.     title = "Open Connections"
  101.   }
  102.   metric {
  103.     name = "mongodb_slave_delay"
  104.     title = "Replica Set Slave Delay"
  105.   }
  106.   metric {
  107.     name = "mongodb_asserts_total"
  108.     title = "Asserts per Second"
  109.   }
  110. }
    从上面你可以发现这个配置文件的写法跟gmond.conf的语法一致,所以有什么不明白的可以参考gmond.conf的写法。
3)、创建mongodb.py脚本
    将mongodb.py文件存放在lib64/ganglia/python_modules目录下,在这个目录中可以看到已经有很多python脚本存在,比如:监控磁盘、内存、网络、mysql、redis等的脚本。我们可以参考这些python脚本完成mongodb.py的编写。我们打开其中部分脚本可以看到在每个脚本中都有一个函数metric_init(params),前面也说过mongodb.pyconf传来的参数传递给metric_init函数。
    

点击(此处)折叠或打开

  1. #!/usr/bin/env python
  2. import json
  3. import os
  4. import re
  5. import socket
  6. import string
  7. import time
  8. import copy

  9. NAME_PREFIX = 'mongodb_'
  10. PARAMS = {
  11.     'server_status' : '/bin/mongo路径 --host host --port 27017 --quiet --eval "printjson(db.serverStatus())"',
  12.     'rs_status' : '/bin/mongo路径 --host host --port 27017 --quiet --eval "printjson(rs.status())"'
  13. }
  14. METRICS = {
  15.     'time' : 0,
  16.     'data' : {}
  17. }
  18. LAST_METRICS = copy.deepcopy(METRICS)
  19. METRICS_CACHE_TTL = 3
  20. def flatten(d, pre = '', sep = '_'):
  21.     """Flatten a dict (i.e. dict['a']['b']['c'] => dict['a_b_c'])"""
  22.     new_d = {}
  23.     for k,v in d.items():
  24.         if type(v) == dict:
  25.             new_d.update(flatten(d[k], '%s%s%s' % (pre, k, sep)))
  26.         else:
  27.             new_d['%s%s' % (pre, k)] = v
  28.     return new_d

  29. def get_metrics():
  30.     """Return all metrics"""
  31.     global METRICS, LAST_METRICS
  32.     if (time.time() - METRICS['time']) > METRICS_CACHE_TTL:
  33.         metrics = {}
  34.         for status_type in PARAMS.keys():
  35.             # get raw metric data
  36.             o = os.popen(PARAMS[status_type])
  37.             # clean up
  38.             metrics_str = ''.join(o.readlines()).strip() # convert to string
  39.             metrics_str = re.sub('\w+\((.*)\)', r"\1", metrics_str) # remove functions
  40.             # convert to flattened dict
  41.             try:
  42.                 if status_type == 'server_status':
  43.                     metrics.update(flatten(json.loads(metrics_str)))
  44.                 else:
  45.                     metrics.update(flatten(json.loads(metrics_str), pre='%s_' % status_type))
  46.             except ValueError:
  47.                 metrics = {}

  48.         # update cache
  49.         LAST_METRICS = copy.deepcopy(METRICS)
  50.         METRICS = {
  51.             'time': time.time(),
  52.             'data': metrics
  53.         }
  54.     return [METRICS, LAST_METRICS]

  55. def get_value(name):
  56.     """Return a value for the requested metric"""
  57.     # get metrics
  58.     metrics = get_metrics()[0]
  59.     # get value
  60.     name = name[len(NAME_PREFIX):] # remove prefix from name
  61.     try:
  62.         result = metrics['data'][name]
  63.     except StandardError:
  64.         result = 0
  65.     return result

  66. def get_rate(name):
  67.     """Return change over time for the requested metric"""
  68.     # get metrics
  69.     [curr_metrics, last_metrics] = get_metrics()
  70.     # get rate
  71.     name = name[len(NAME_PREFIX):] # remove prefix from name
  72.     try:
  73.         rate = float(curr_metrics['data'][name] - last_metrics['data'][name]) / \
  74.         float(curr_metrics['time'] - last_metrics['time'])
  75.         if rate < 0:
  76.             rate = float(0)
  77.     except StandardError:
  78.         rate = float(0)
  79.     return rate

  80. def get_opcounter_rate(name):
  81.     """Return change over time for an opcounter metric"""
  82.     master_rate = get_rate(name)
  83.     repl_rate = get_rate(name.replace('opcounters_', 'opcountersRepl_'))
  84.     return master_rate + repl_rate

  85. def get_globalLock_ratio(name):
  86.     """Return the global lock ratio"""
  87.     try:
  88.         result = get_rate(NAME_PREFIX + 'globalLock_lockTime') / \
  89.         get_rate(NAME_PREFIX + 'globalLock_totalTime') * 100
  90.     except ZeroDivisionError:
  91.         result = 0
  92.     return result

  93. def get_indexCounters_btree_miss_ratio(name):
  94.     """Return the btree miss ratio"""
  95.     try:
  96.         result = get_rate(NAME_PREFIX + 'indexCounters_btree_misses') / \
  97.         get_rate(NAME_PREFIX + 'indexCounters_btree_accesses') * 100
  98.     except ZeroDivisionError:
  99.         result = 0
  100.     return result

  101. def get_connections_current_ratio(name):
  102.     """Return the percentage of connections used"""
  103.     try:
  104.         result = float(get_value(NAME_PREFIX + 'connections_current')) / \
  105.         float(get_value(NAME_PREFIX + 'connections_available')) * 100
  106.     except ZeroDivisionError:
  107.         result = 0
  108.     return result

  109. def get_slave_delay(name):
  110.     """Return the replica set slave delay"""
  111.     # get metrics
  112.     metrics = get_metrics()[0]
  113.     # no point checking my optime if i'm not replicating
  114.     if 'rs_status_myState' not in metrics['data'] or metrics['data']['rs_status_myState'] != 2:
  115.         result = 0
  116.     # compare my optime with the master's
  117.     else:
  118.         master = {}
  119.         slave = {}
  120.     try:
  121.         for member in metrics['data']['rs_status_members']:
  122.             if member['state'] == 1:
  123.                 master = member
  124.             if member['name'].split(':')[0] == socket.getfqdn():
  125.                 slave = member
  126.         result = max(0, master['optime']['t'] - slave['optime']['t']) / 1000
  127.     except KeyError:
  128.         result = 0
  129.     return result

  130. def get_asserts_total_rate(name):
  131.     """Return the total number of asserts per second"""
  132.     return float(reduce(lambda memo,obj: memo + get_rate('%sasserts_%s' % (NAME_PREFIX, obj)),['regular', 'warning', 'msg', 'user', 'rollovers'], 0))

  133. def metric_init(lparams):
  134.     """Initialize metric descriptors"""
  135.     global PARAMS
  136.     # set parameters
  137.     for key in lparams:
  138.         PARAMS[key] = lparams[key]
  139.     # define descriptors
  140.     time_max = 60
  141.     groups = 'mongodb'
  142.     descriptors = [
  143.         {
  144.             'name': NAME_PREFIX + 'opcounters_insert',
  145.             'call_back': get_opcounter_rate,
  146.             'time_max': time_max,
  147.             'value_type': 'float',
  148.             'units': 'Inserts/Sec',
  149.             'slope': 'both', 
  150.             'format': '%f',
  151.             'description': 'Inserts',
  152.             'groups': groups
  153.         },
  154.         {
  155.             'name': NAME_PREFIX + 'opcounters_query',
  156.             'call_back': get_opcounter_rate,
  157.             'time_max': time_max,
  158.             'value_type': 'float',
  159.             'units': 'Queries/Sec',
  160.             'slope': 'both',
  161.             'format': '%f',
  162.             'description': 'Queries',
  163.             'groups': groups
  164.         },
  165.         {
  166.             'name': NAME_PREFIX + 'opcounters_update',
  167.             'call_back': get_opcounter_rate,
  168.             'time_max': time_max,
  169.             'value_type': 'float',
  170.             'units': 'Updates/Sec',
  171.             'slope': 'both',
  172.             'format': '%f',
  173.             'description': 'Updates',
  174.             'groups': groups
  175.         },
  176.         {
  177.             'name': NAME_PREFIX + 'opcounters_delete',
  178.             'call_back': get_opcounter_rate,
  179.             'time_max': time_max,
  180.             'value_type': 'float',
  181.             'units': 'Deletes/Sec',
  182.             'slope': 'both',
  183.             'format': '%f',
  184.             'description': 'Deletes',
  185.             'groups': groups
  186.         },
  187.         {
  188.             'name': NAME_PREFIX + 'opcounters_getmore',
  189.             'call_back': get_opcounter_rate,
  190.             'time_max': time_max,
  191.             'value_type': 'float',
  192.             'units': 'Getmores/Sec',
  193.             'slope': 'both',
  194.             'format': '%f',
  195.             'description': 'Getmores',
  196.             'groups': groups
  197.         },
  198.         {
  199.             'name': NAME_PREFIX + 'opcounters_command',
  200.             'call_back': get_opcounter_rate,
  201.             'time_max': time_max,
  202.             'value_type': 'float',
  203.             'units': 'Commands/Sec',
  204.             'slope': 'both',
  205.             'format': '%f',
  206.             'description': 'Commands',
  207.             'groups': groups
  208.         },
  209.         {
  210.             'name': NAME_PREFIX + 'backgroundFlushing_flushes',
  211.             'call_back': get_rate,
  212.             'time_max': time_max,
  213.             'value_type': 'float',
  214.             'units': 'Flushes/Sec',
  215.             'slope': 'both',
  216.             'format': '%f',
  217.             'description': 'Flushes',
  218.             'groups': groups
  219.         },
  220.         {
  221.             'name': NAME_PREFIX + 'mem_mapped',
  222.             'call_back': get_value,
  223.             'time_max': time_max,
  224.             'value_type': 'uint',
  225.             'units': 'MB',
  226.             'slope': 'both',
  227.             'format': '%u',
  228.             'description': 'Memory-mapped Data',
  229.             'groups': groups
  230.         },
  231.         {
  232.             'name': NAME_PREFIX + 'mem_virtual',
  233.             'call_back': get_value,
  234.             'time_max': time_max,
  235.             'value_type': 'uint',
  236.             'units': 'MB',
  237.             'slope': 'both',
  238.             'format': '%u',
  239.             'description': 'Process Virtual Size',
  240.             'groups': groups
  241.         },
  242.         {
  243.             'name': NAME_PREFIX + 'mem_resident',
  244.             'call_back': get_value,
  245.             'time_max': time_max,
  246.             'value_type': 'uint',
  247.             'units': 'MB',
  248.             'slope': 'both',
  249.             'format': '%u',
  250.             'description': 'Process Resident Size',
  251.             'groups': groups
  252.         },
  253.         {
  254.             'name': NAME_PREFIX + 'extra_info_page_faults',
  255.             'call_back': get_rate,
  256.             'time_max': time_max,
  257.             'value_type': 'float',
  258.             'units': 'Faults/Sec',
  259.             'slope': 'both',
  260.             'format': '%f',
  261.             'description': 'Page Faults',
  262.             'groups': groups
  263.         },
  264.         {
  265.             'name': NAME_PREFIX + 'globalLock_ratio',
  266.             'call_back': get_globalLock_ratio,
  267.             'time_max': time_max,
  268.             'value_type': 'float',
  269.             'units': '%',
  270.             'slope': 'both',
  271.             'format': '%f',
  272.             'description': 'Global Write Lock Ratio',
  273.             'groups': groups
  274.         },
  275.         {
  276.             'name': NAME_PREFIX + 'indexCounters_btree_miss_ratio',
  277.             'call_back': get_indexCounters_btree_miss_ratio,
  278.             'time_max': time_max,
  279.             'value_type': 'float',
  280.             'units': '%',
  281.             'slope': 'both',
  282.             'format': '%f',
  283.             'description': 'BTree Page Miss Ratio',
  284.             'groups': groups
  285.         },
  286.         {
  287.             'name': NAME_PREFIX + 'globalLock_currentQueue_total',
  288.             'call_back': get_value,
  289.             'time_max': time_max,
  290.             'value_type': 'uint',
  291.             'units': 'Operations',
  292.             'slope': 'both',
  293.             'format': '%u',
  294.             'description': 'Total Operations Waiting for Lock',
  295.             'groups': groups
  296.         },
  297.         {
  298.             'name': NAME_PREFIX + 'globalLock_currentQueue_readers',
  299.             'call_back': get_value,
  300.             'time_max': time_max,
  301.             'value_type': 'uint',
  302.             'units': 'Operations',
  303.             'slope': 'both',
  304.             'format': '%u',
  305.             'description': 'Readers Waiting for Lock',
  306.             'groups': groups
  307.         },
  308.         {
  309.             'name': NAME_PREFIX + 'globalLock_currentQueue_writers',
  310.             'call_back': get_value,
  311.             'time_max': time_max,
  312.             'value_type': 'uint',
  313.             'units': 'Operations',
  314.             'slope': 'both',
  315.             'format': '%u',
  316.             'description': 'Writers Waiting for Lock',
  317.             'groups': groups
  318.         },
  319.         {
  320.             'name': NAME_PREFIX + 'globalLock_activeClients_total',
  321.             'call_back': get_value,
  322.             'time_max': time_max,
  323.             'value_type': 'uint',
  324.             'units': 'Clients',
  325.             'slope': 'both',
  326.             'format': '%u',
  327.             'description': 'Total Active Clients',
  328.             'groups': groups
  329.         },
  330.         {
  331.             'name': NAME_PREFIX + 'globalLock_activeClients_readers',
  332.             'call_back': get_value,
  333.             'time_max': time_max,
  334.             'value_type': 'uint',
  335.             'units': 'Clients',
  336.             'slope': 'both',
  337.             'format': '%u',
  338.             'description': 'Active Readers',
  339.             'groups': groups
  340.         },
  341.         {
  342.             'name': NAME_PREFIX + 'globalLock_activeClients_writers',
  343.             'call_back': get_value,
  344.             'time_max': time_max,
  345.             'value_type': 'uint',
  346.             'units': 'Clients',
  347.             'slope': 'both',
  348.             'format': '%u',
  349.             'description': 'Active Writers',
  350.             'groups': groups
  351.         },
  352.         {
  353.             'name': NAME_PREFIX + 'connections_current',
  354.             'call_back': get_value,
  355.             'time_max': time_max,
  356.             'value_type': 'uint',
  357.             'units': 'Connections',
  358.             'slope': 'both',
  359.             'format': '%u',
  360.             'description': 'Open Connections',
  361.             'groups': groups
  362.         },
  363.         {
  364.             'name': NAME_PREFIX + 'connections_current_ratio',
  365.             'call_back': get_connections_current_ratio,
  366.             'time_max': time_max,
  367.             'value_type': 'float',
  368.             'units': '%',
  369.             'slope': 'both',
  370.             'format': '%f',
  371.             'description': 'Percentage of Connections Used',
  372.             'groups': groups
  373.         },
  374.         {
  375.             'name': NAME_PREFIX + 'slave_delay',
  376.             'call_back': get_slave_delay,
  377.             'time_max': time_max,
  378.             'value_type': 'uint',
  379.             'units': 'Seconds',
  380.             'slope': 'both',
  381.             'format': '%u',
  382.             'description': 'Replica Set Slave Delay',
  383.             'groups': groups
  384.         },
  385.         {
  386.             'name': NAME_PREFIX + 'asserts_total',
  387.             'call_back': get_asserts_total_rate,
  388.             'time_max': time_max,
  389.             'value_type': 'float',
  390.             'units': 'Asserts/Sec',
  391.             'slope': 'both',
  392.             'format': '%f',
  393.             'description': 'Asserts',
  394.             'groups': groups
  395.         }
  396.     ]
  397.     return descriptors

  398. def metric_cleanup():
  399.     """Cleanup"""
  400.     pass

  401. # the following code is for debugging and testing
  402. if __name__ == '__main__':
  403.     descriptors = metric_init(PARAMS)
  404.     while True:
  405.         for d in descriptors:
  406.             print (('%s = %s') % (d['name'], d['format'])) % (d['call_back'](d['name']))
  407.         print ''
  408.         time.sleep(METRICS_CACHE_TTL)
    python扩展脚本中必须要重写的函数有:metric_init(params)metric_cleanup()
    
metric_init()函数在模块初始化的时候调用,必须要返回一个metric描述字典或者字典列表,mongdb.py就返回了字典列表。

    Metric字典定义如下

    d = {‘name’ : ‘’,   #这个name必须跟pyconf文件中的名字保持一致

            'call_back’ : ,

            'time_max’ : int(),

            'value_type’ : ‘’,

            'units’ : ’’,

            'slope’ : ‘’,

            'format’ : ‘’,

            'description’ : ‘’
        }
    metric_cleanup()函数在模块结束时调用,无数据返回
4)、在web端查看监控统计
    完成脚本编写后,重启gmond服务。

10-10 03:52