Metrics and Monitoring

Metrics and Monitoring#

Kafka metrics are accessed using JMX (built-in Java technology), accessible by passing a JMX-option via the KAFKA_JMX_OPTS env variable. This can report to logging systems such as Grafana or Splunk.

KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -
Djava.rmi.server.hostname=localhost"

Kafka also uses Yammer Metrics, for metrics reporting in the server.

Common Broker Metrics#

Common broker metrics include:

ACTIVE CONTROLLER COUNT: Is the broker the controller?
REQUEST HANDLER IDLE RATIO: How much load is the broker under?
ALL TOPICS BYTES IN: Do I have enough brokers?
ALL TOPICS BYTES OUT: High consumer traffic?
ALL TOPICS MESSAGES IN: Messages per second?
PARTITION COUNT: Many partitions assigned to a broker?
LEADER COUNT: How many partitions is this broker a leader for?
OFFLINE PARTITIONS: Any brokers without leader?
REQUEST METRICS: Number of requests to broker?

Monitoring in Java#

In Java, the main monitoring metrics are related to garbage collection (GC). This includes:

CollectionCount: # of GC cycles
CollectionTime: Time spent in GC cycle

Common Producer Metrics#

Common producer metrics include:

response-rate (global and per broker): Responses (acks) received per second. Sudden changes in this value could signal a problem, though what the problem could be depends on your configuration.
request-rate (global and per broker): Average requests sent per second. Requests can contain multiple records, so this is not the number of records. It does give you part of the overall throughput picture.
request-latency-avg (per broker): Average request latency in ms. High latency could be a sign of performance issues, or just large batches.
outgoing-byte-rate (global and per broker): Bytes sent per second. Good picture of your network throughput. Helps with network planning.
io-wait-time-ns-avg (global only): Average time spent waiting for a socket ready for reads/writes in nanoseconds. High wait times might mean your producers are producing more data than the cluster can accept and process.

Common Consumer Metrics#

Some important consumer metrics include:

records-lag-max: Maximum record lag. How far the consumer is behind producers. In a situation where real-time processing is important, high lag might mean you need more consumers.
bytes-consumed-rate: Rate of bytes consumed per second. Gives a good idea of throughput.
records-consumed-rate: Rate of records consumed per second.
fetch-rate: Fetch requests per second. If this falls suddenly or goes to zero, it may be an indication of problems with the consumer.

Diagnosis#

To diagnosis a unbalanced load / under-replicated partitions, consider the following metrics:

Partition and leader partition count
For all topics:
- Messages in rate
- Bytes in rate
- Bytes out rate

The numbers will be even across all brokers. If not, you will need to move partiitons, by using the kafka-reassign-partitions.sh command.

Other problems with your brokers can include hardware failures; a broker’s ability to serve requests. Metrics to investigate is:

CPU utilization
Inbound and outbound network throughput
Disk average wait time
Disk percentage utilization

Analyze these as the traffic increases. For cluster use, monitor the all topics bytes in rate metric.