Basics of Kafka and its terminology
What is Kafka ?
Kafka is a Distributed Streaming Platform or a Distributed Commit Log
Let’s try to understand these jargons.
Distributed
Kafka operates as a cluster of one or more servers (called nodes) that can be spread across multiple data centers. This setup allows data and workload to be distributed across different nodes in the cluster, making Kafka scalable, highly available, and fault-tolerant.
Example: Imagine a team of workers (nodes) sharing tasks. If one worker fails, others can take over, ensuring the work continues without interruption.
Streaming Platform
Kafka stores data as a continuous stream of records, which can be processed in various ways.
Example: Think of a river (stream) where water (data) flows continuously. We can analyse or use the water at any point along the river.
Commit Log
When data is sent to Kafka, it is appended to a stream of records, much like adding entries to a log file. If we’re familiar with databases, this is similar to a Write-Ahead Log (WAL). This stream can be replayed or read from any point in time.
Example: Imagine writing in a diary. Each new entry is added to the end, and we can go back and read any previous entry whenever we want.
[Q 1.1] Is Kafka a message queue ?
Kafka can function as a message queue, but it's much more than that. It can act as:
1. A FIFO (First-In-First-Out) queue
2. A Pub/Sub (Publish-Subscribe) messaging system
3. A real-time streaming platform
4. Even a database (due to its durable storage capabilities)
However, Kafka is most commonly used for:
1. Building real-time data pipelines (transferring data between systems)
2. Processing continuously flowing data
3. Creating event-driven systems
[Q 1.2] How is Kafka different then that of RabbitMQ ?
Kafka and RabbitMQ are both message brokers, but they are designed for different use cases.
Kafka:
1. Uses a log-based architecture where messages are stored persistently and consumed at different times.
2. Consumers pull messages at their own pace.
3. Supports replaying messages (by reading old offsets).
4. Kafka is optimized for high throughput (millions of messages/sec).
5. Kafka always stores messages, ensuring at-least-once delivery.
RabbitMQ:
1. Messages go into queues, and a consumer removes them after processing.
2. Uses push-based delivery (messages are immediately sent to consumers).
3. Messages are not stored long-term by default.
4. RabbitMQ is faster for low-latency messaging but has lower throughput.
5. RabbitMQ deletes messages after acknowledgment, unless persistence is enabled.
Core Kafka Concepts
1. Message
A message is the smallest unit of data in Kafka.
Example: If we’re building a log monitoring system, each log entry (like a JSON object) is a message. Kafka stores this message as a byte array.
2. Topic
A topic is a logical category or stream of similar messages. Think of it as a folder where related files (messages) are stored.
Example:
If we have a logging system, we might create topics like:
appLogs
for application logsingressLogs
for network logsdbLogs
for database logs
This way, messages are logically organised, similar to having different tables in a database.
3. Partitions
A partition is like a shard in a database. It’s how Kafka splits data within a topic to enable scalability.
Example:
Imagine a topic as a large bookshelf. If the shelf gets too full, we split it into smaller sections (partitions). Each section holds a portion of the data.
Partitions allow Kafka to handle large amounts of data by distributing it across multiple nodes. Each message in a partition has an offset, which is like an index number indicating its position in the partition.
[Q 1.3] Are the different topics queued in the same message queue or do we have individual queue for individual topics ?
In Kafka, each topic has its own separate log (queue). Messages from different topics are not mixed into a single queue.
How Kafka Stores Messages:
1. Each topic is independent and maintains its own log files.
2. A topic is divided into partitions, and each partition acts like a separate queue.
3. Kafka does not have a single global message queue for all topics.
Example: Assume we have two topics:
orders-topic (Partition 0) --> [Order1, Order2, Order3, ...]
orders-topic (Partition 1) --> [Order4, Order5, Order6, ...]
payments-topic (Partition 0) --> [Payment1, Payment2, Payment3, ...]
payments-topic (Partition 1) --> [Payment4, Payment5, Payment6, ...]
[Q 1.4] Suppose a topic has 4 partitions, now when a message is pushed to kafka, to which partiton it will choose from ?
If a Key is Provided in the Message:
1. Kafka applies a hashing function on the key to determine the partition.
Partition = hash(key) % number_of_partitions
2. This ensures that messages with the same key always go to the same partition.
If No Key is Provided:
1. Kafka uses a round-robin and similar approach to distribute messages evenly across partitions.
2. The producer’s partitioner chooses partitions in sequence to balance the load.
4. Producer
A producer is a Kafka client that sends messages to a topic. It also decides which partition to send the message to, based on:
- No key: Messages are distributed randomly across partitions.
- Key specified: Messages with the same key go to the same partition (useful for maintaining order).
- Custom logic: We can define rules for partitioning.
Example:
If we’re logging data from different servers, we might use the server ID as the key. This ensures all logs from the same server go to the same partition, maintaining order.
5. Consumer
A consumer reads messages from partitions in the order they were written. It keeps track of the last message it read using an offset.
Example:
Imagine reading a book. We bookmark the last page we read (offset). If we stop and come back later, we can resume from where we left off.
[Q 1.5] Suppose I have 2 Consumer and 1 is in awake condition and other is sleeping, when a message is pushed to topic, what happens after that ? Does the one that is sleeping misses the information ?
Scenario 1: Both Consumers Are in the Same Consumer Group
1. When you have two consumers in the same consumer group, Kafka ensures that each partition's messages are evenly distributed among active consumers.
2. If one consumer is asleep (inactive), Kafka won't send messages to it. Instead, all messages will be handled by the active consumer.
3. If the sleeping consumer wakes up later, it will resume from the last committed offset (assuming auto-offset commit is enabled).
Scenario 2: Consumers Belong to Different Consumer Groups
1. If the two consumers are in separate consumer groups, Kafka delivers the message to both (assuming each group has subscribed to the topic).
2. The sleeping consumer does not miss the message; when it wakes up, it can start reading from where it last left off.
Final Outcome:
If both consumers are in the same consumer group, only the active one processes the message.
If they are in different consumer groups, both can receive the message when they become active.
6. Consumer Group
A consumer group is a set of consumers working together to read messages from a topic.
Key Points:
- Fan-out: Multiple consumer groups can read from the same topic.
- Order guarantee: Each partition is read by only one consumer in a group, ensuring messages are processed in order.
Example:
If we’re building an OTP service, one consumer group can send SMS, and another can send emails, both reading from the same topic.
Example:
Let suppose the producer produces 6 messages. Each message is a key-value pair, for key “A” value is “1”, for “C” value is “1”, for “B” value is “1”, for “C” value is “2” ….. “B” value is “2”. (Please note that by key I mean the message key that we discussed earlier and not the JSON or Map key). Our topic has 3 partitions, and due to consistent hashing messages with the same key always go to the same partition, so all the messages with “A” as the key will get grouped and the same for B and C. Now as each partition has only one consumer, they get messages in order only. So the consumer will receive A1 before A2 and B1 before B2, and thus the order is maintained. Going back to our logging system example the keys are the source node ID, then all the logs for node1 will go to the same partition always. And since the messages are always going to the same partition, we will have the order of the messages maintained.
This will not be possible if the same partition had multiple consumers in the same group. If you read the same partition in the different consumers who are in different groups, then also for each consumer group the messages will end up ordered.
So for 3 partitions, you can have a max of 3 consumers, if you had 4 consumers, one consumer will be sitting idle. But for 3 partitions you can have 2 consumers, then one consumer will read from one partition and one consumer will read from two partitions. If one consumer goes down in this case, the last surviving consumer will end up reading from all the three partitions, and when new consumers are added back, again partition would be split between consumers, this is called re-balancing.
7. Broker
A broker is a single Kafka server. It receives messages from producers, assigns offsets, and stores them in partitions.
Example:
Think of a broker as a post office that receives letters (messages), sorts them, and stores them in the correct mailbox (partition).
8. Cluster
A Kafka cluster is a group of brokers working together to provide scalability, availability, and fault tolerance.
In a cluster, partitions are replicated on multiple brokers depending on the replication factor of the topic to have failover capability. What I mean is, for a topic of replication factor 3, each partition of that topic will live onto 3 different brokers. When a partition is replicated onto 3 brokers, one of the brokers will act as the leader for that partition and the rest two will be followers. Data is always written on the leader broker and then replicated to the followers. This way we do not lose data nor availability of the cluster, and if the leader goes down another leader is elected
Example:
Imagine a team of post offices (brokers) working together. If one post office fails, others can handle its workload.
9. Zookeeper
Zookeeper is a centralized service that manages Kafka’s metadata, such as broker status, topic configurations, and leader elections.
Example:
Think of Zookeeper as a manager who keeps track of all the post offices (brokers) and ensures everything runs smoothly.
Beyond Core Concepts
Producer Configurations
- Fire and forget: Send messages without waiting for acknowledgment (fast but less reliable).
- Synchronous send: Wait for acknowledgment after sending each message (slower but reliable).
- Asynchronous send: Send messages in the background (balanced approach).
Acknowledgment Levels:
- ACK 0: No acknowledgment (fastest).
- ACK 1: Leader broker acknowledgment (faster).
- ACK All: All replicas acknowledgment (most reliable).
Consumer Configurations
- Poll loop: Consumers constantly poll brokers for new messages.
- Partition assignment strategies: Decide how partitions are assigned to consumers (e.g., round-robin, sticky).
- Offset commit: Consumers track their progress by committing offsets.