ZooKeeper 监视器指南
新的度量系统
特性:New Metrics System
从3.6.0开始提供,提供丰富的指标帮助用户监控ZooKeeper的主题:znode、网络、磁盘、quorum、leader选举、client、security、failures、watch/session、requestProcessor等向前。
指标
所有指标都包含在ServerMetrics.java
.
普罗米修斯
- 运行Prometheus监控服务是摄取和记录 ZooKeeper 指标的最简单方法。
- 先决条件:
- 通过zoo.cfg 中
Prometheus MetricsProvider
的设置启用。metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
- 端口也可以通过设置来配置
metricsProvider.httpPort
(默认值:7000) - 安装 Prometheus:进入官网下载页面,下载最新版本。
-
将 Prometheus 的爬虫设置为以 ZooKeeper 集群端点为目标:
cat > /tmp/test-zk.yaml <<EOF global: scrape_interval: 10s scrape_configs: - job_name: test-zk static_configs: - targets: ['192.168.10.32:7000','192.168.10.33:7000','192.168.10.34:7000'] EOF cat /tmp/test-zk.yaml
-
设置 Prometheus 处理程序:
nohup /tmp/prometheus \ --config.file /tmp/test-zk.yaml \ --web.listen-address ":9090" \ --storage.tsdb.path "/tmp/test-zk.data" >> /tmp/test-zk.log 2>&1 &
-
现在 Prometheus 将每 10 秒抓取一次 zk 指标。
使用 Prometheus 发出警报
-
我们建议您阅读Prometheus 官方 Alerting Page以探索一些警报原理
-
我们建议您使用Prometheus Alertmanager,它可以帮助用户以更方便的方式接收警报电子邮件或即时消息(通过 webhook)
-
我们提供了一个警报示例,其中应特别注意这些指标。注:仅供参考,需要根据自己的实际情况和资源环境进行调整
use ./promtool check rules rules/zk.yml to check the correctness of the config file cat rules/zk.yml groups: - name: zk-alert-example rules: - alert: ZooKeeper server is down expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} ZooKeeper server is down" description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]." - alert: create too many znodes expr: znode_count > 1000000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many znodes" description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]." - alert: create too many connections expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many connections" description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]." - alert: znode total occupied memory is too big expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB) for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} znode total occupied memory is too big" description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB." - alert: set too many watch expr: watch_count > 10000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} set too many watch" description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]." - alert: a leader election happens expr: increase(election_time_count[5m]) > 0 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} a leader election happens" description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]." - alert: open too many files expr: open_file_descriptor_count > 300 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} open too many files" description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]." - alert: fsync time is too long expr: rate(fsynctime_sum[1m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} fsync time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]." - alert: take snapshot time is too long expr: rate(snapshottime_sum[5m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} take snapshot time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]." - alert: avg latency is too high expr: avg_latency > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} avg latency is too high" description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]." - alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: "JVM memory filling up (instance {{ $labels.instance }})" description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }} value = {{ $value }}\n"
格拉法纳
- Grafana 内置 Prometheus 支持;只需添加一个 Prometheus 数据源:
Name: test-zk Type: Prometheus Url: http://localhost:9090 Access: proxy
- 然后下载并导入默认的 ZooKeeper 仪表板模板并自定义。
- 如果有任何好的改进,用户可以通过发送电子邮件至dev@zookeeper.apache.org来询问 Grafana 仪表板帐户。
涌入数据库
InfluxDB 是一种开源时间序列数据,通常用于存储来自 Zookeeper 的指标。您可以下载开源版本或在 InfluxDB Cloud 上创建免费帐户。在任何一种情况下,配置Apache Zookeeper Telegraf 插件以开始从 Zookeeper 集群收集指标并将其存储到 InfluxDB 实例中。还有一个Apache Zookeeper InfluxDB 模板,其中包括 Telegraf 配置和仪表板,可让您立即进行设置。
JMX
更多细节可以在这里找到
四个字母的单词
更多细节可以在这里找到