Apache Kafka Terminologies & Concepts

Image for post
Image for post

1. Publish / Subscribe Messaging

2. What is Kafka?

One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.

3. Messages, Keys, and Batches

For efficiency, messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition. An individual roundtrip across the network for each message would result in excessive overhead, and collecting messages together into a batch reduces this. Of course, this is a tradeoff between latency and throughput: the larger the batches, the more messages that can be handled per unit of time, but the longer it takes an individual message to propagate.

4. Topics and Partitions

Image for post
Image for post

Messages in Kafka are categorized into topics that is an Entity in Kafka with a name. The closest analogies for a topic are a database table or a folder in a filesystem. Topics are additionally broken down into a number of partitions.

Image for post
Image for post

A partition is a single log and each topic in general, can have one or more partitions. Messages are written to it in an append-only fashion, and are read in order from beginning to end. Each partition is an ordered, immutable sequence of records. Kafka will assign a sequential number called offset to each message that is written. Note that as a topic typically has multiple partitions that is independent with each other, so there is no guarantee of message time-ordering across the entire topic, just within a single partition. Partitions are also the way that Kafka provides redundancy and scalability. Each partition can be hosted on a different server, which means that a single topic can be scaled horizontally across multiple servers to provide performance far beyond the ability of a single server.

5. Producers and Consumers

Image for post
Image for post

Producers behavior is to create new messages. In general, a message will be produced to a specific topic. By default, the producer doesn’t care which partition a specific message is written to and will balance messages over all partitions of a topic evenly. In some cases, the producer will direct messages to specific partitions. This is typically done using the message key and a partitioner that will generate a hash of the key and map it to a specific partition.

Image for post
Image for post

Consumers behavior is to read messages. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset is another bit of metadata that is an integer value that continually increases that Kafka adds to each message as it is produced. By storing the offset of the last consumed message for each partition, either in Zookeeper or in Kafka(in a topic called __consumer_offsets) itself, a consumer can stop and restart without losing its place.

Image for post
Image for post

Consumers work as part of a consumer group, which is one or more consumers that work together to consume a topic. In this way, consumers can horizontally scale to consume topics with a large number of messages. Additionally, if a single consumer fails, the remaining members of the group will rebalance the partitions being consumed to take over for the missing member. Consumer groups are used for scalable message consumption. Each different application will have a unique consumer group. Kafka Broker manages the consumer group.

Image for post
Image for post

6. Brokers and Clusters

Image for post
Image for post

Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller(elected automatically from the live members of the cluster). The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. A partition may be assigned to multiple brokers, which will result in the partition being replicated. This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure. However, all consumers and producers operating on that partition must connect to the leader.

A key feature of Kafka is that of retention, which is the durable storage of messages for some period of time. Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time(7 days) or until the topic reaches a certain size in bytes(1 GB). Once these limits are reached, messages are expired and deleted so that the retention configuration is in a minimum amount of data available at any time. Individual topics can also be configured with their own retention settings so that messages are stored for only as long as they are useful.

7. Kafka as a Distributed Streaming System

What is a Distributed System ?

Image for post
Image for post

Key characteristics of distributed system

  • Reliable Work Distribution: The requests that is received from the clients in general are equally distributed between the available systems.
  • Easily Scalable: Adding new systems to the existing setup is easy.

How Kafka Cluster distributes the client requests

Image for post
Image for post

First create request to create a new topic to Apache Zookeeper. Then Zookeeper takes care of redirecting the request to the controller. The role of this controller is to distribute the ownership of the partitions to the available broker. This concept of distributing partitions to the broker is called leader assignment. So the topic is distributed across the Kafka Cluster.

2. How Kafka distributes client requests: Kafka Producer

Kafka Producer’s produce messages and send it to the partitioner. The partitioner checks which partition this message should be sent, and sends it to the leader of the determined partition. After that the message is persisted into the file system of the Broker.

Image for post
Image for post

The client request from the producer are distributed between the brokers based on the partition. Which indirectly means that the load is distributed between the brokers.

3. How Kafka distributes client requests: Kafka Consumer

Kafka Consumers’s pull the messages from its configured topic. The request goes to all of the partitions and retrieve the messages from them. The retrieved messages are handed over to the consumer, and the consumer process the messages successfully.

Image for post
Image for post

Even from the consumer perspective, the request to retrieve data are distributed between brokers. Basically the client will only go to the partition leader of the topic and retrieve the data.

Summary

  • Clients will only invoke the leader of the partition to produce and consume data.
  • Load is evenly distributed between the brokers.

8. How Kafka handles data loss ?

Image for post
Image for post

For example, when the Kafka Producer sends a message to Partition 0, then Broker 1 will receive the message and Broker 1 will be the leader replica of Partition 0. Now there is one copy of the actual message, but since the Replication Factor(number of copies of the message)is three, there should be two more copies of the same message. So rest of the Brokers will replicate the Partitions to themselves, and it is called the follower replica.

Lets say that Broker 1 is down. But still the data of the partition is available in broker 2 and broker 3, so there will be no data loss. Zookeeper gets notified about the failure and it assigns the new leader to the controller.

Backend Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store