Kafka Storage and Data Handling

Kafka Storage and Data Handling#

File formats and Indexes#

Data is formated in a seleceted directory, determined by the log.dirs configuration in server.properties. As more messages are produces, this directory will increase, and files (log segments) will duplicate over time. For instance, a partition folder named test-0 will typically contain:

$ ls /tmp/kafka-logs-0/test-0 
-rw-r--r--@  1 wheel    10M Mar 15 12:02 00000000000000000000.index
-rw-r--r--@  1 wheel   557B Mar 15 14:20 00000000000000000000.log
-rw-r--r--@  1 wheel    10M Mar 15 12:02 00000000000000000000.timeindex
-rw-r--r--@  1 wheel    56B Mar 15 12:02 00000000000000000004.snapshot
-rw-r--r--   1 wheel    16B Mar 15 14:19 leader-epoch-checkpoint
-rw-r--r--@  1 wheel    43B Mar 13 15:49 partition.metadata

In order to enable log compaction, pass an additional configuration flag when creating a topic: --config cleanup.policy=compacy, with parameters --config min.cleanable.dirt.ratio and --config segment.ms determining how often compaction is performed (frequency and segment creation time).

File/Storage Administration#

Default retention period is one week, and default retention size is 1Gb for each partition. As more messages are produced, segments accumulate and contain individual offsets. Thus, only one segment will be active, known as the ** active segment**. An active segment is never deleted. In addition, partitions are distributed to brokers through a ** round-robin** distribution. Example:

Partition 0 -> Broker 1
Partition 1 -> Broker 2
Partition 2 -> Broker 3

Storage Structures#

Kafka supports several storage structures, including

Lambda architecture: Batch processing and Operational workloads – Real-time and historical view
Kappa architecture: V1 and V2 in parallel – Helps with migration or transitinon to continuous jobs
Multiple consumption: Multi-cluster – Replicates clusters for scale or ingesting from different Kafka clusters

Multi-cluster architectures#

Several architectures regarding multiple clusters exist. N.B.: Brokers of the same cluster must be in same region!

Multiple clusters may be located in different geographical location, connecting to a central Kafka clutser (* Hub-and-Spoke*) (Many to one) Relevant for apps that need data from multiple sources/locations.
- Data that can be completely separated should be kept locally
Multiple Kafka clusters may be communicating with each other (Active-active).
- Produce and consume events accross multiple clusters
- Most scalable and cost-efficient architecture.
- May result in a loop, going back and forth endlessly
A backup architecture; (Active-Standby) with a “main” production cluster, and one (or multiple) failover cluster.
- Inactive copy is a cold copy.

Example: MirrowMaker#

A framework for replicating data between two datacenters/clusters. The MirrorMaker uses a shared producer to send all its events to a target cluster, but has several consumers depending on the source cluster (3 topics -> 3 consumers).

To perform the process:

kafka-mirror-maker --consumer.config config/consumer.properties --producer.config config/producer.properties --new.consumer --num.streams=2 --whitelist ".*"

MirrorMaker can also be run as a Docker container, and needs to be run in the source cluster.