1. Publish / Subscribe Messaging
Publish / Subscribe Messaging is a pattern that is characterized by the sender(publisher) of a piece of data(message) not specifically directing it to a receiver. Instead, the publisher classifies the message somehow, and that receiver(subscriber) subscribes to receive certain classes of messages. Pub/Sub systems often have a broker, a central point where messages are published, to facilitate this.
2. What is Kafka?
Apache Kafka is a publish/subscribe messaging system designed to have a single centralized system that allows for publishing generic types of data. It is often described as a distributed streaming system. Data within Kafka is stored durably, in order, and can be read deterministically. In addition, the data can be distributed within the system to provide additional protections against failures, as well as significant opportunities for scaling performance.
One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
3. Messages, Keys, and Batches
The unit of data within Kafka is called a message. A message is simply an array of bytes. A message can have an optional bit of metadata, which is referred to as a key. Keys are used when messages are to be written to partitions in a more controlled manner. The simplest such scheme is to generate a consistent hash of the key, and then select the partition number for that message by taking the result of the hash modulo, the total number of partitions in the topic. This assures that messages with the same key are always written to the same partition.
For efficiency, messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition. An individual roundtrip across the network for each message would result in excessive overhead, and collecting messages together into a batch reduces this. Of course, this is a tradeoff between latency and throughput: the larger the batches, the more messages that can be handled per unit of time, but the longer it takes an individual message to propagate.
4. Topics and Partitions
Messages in Kafka are categorized into topics that is an Entity in Kafka with a name. The closest analogies for a topic are a database table or a folder in a filesystem. Topics are additionally broken down into a number of partitions.
A partition is a single log and each topic in general, can have one or more partitions. Messages are written to it in an append-only fashion, and are read in order from beginning to end. Each partition is an ordered, immutable sequence of records. Kafka will assign a sequential number called offset to each message that is written. Note that as a topic typically has multiple partitions that is independent with each other, so there is no guarantee of message time-ordering across the entire topic, just within a single partition. Partitions are also the way that Kafka provides redundancy and scalability. Each partition can be hosted on a different server, which means that a single topic can be scaled horizontally across multiple servers to provide performance far beyond the ability of a single server.
5. Producers and Consumers
Kafka clients are users of the system, and there are two basic types: producers and consumers. There are also advanced client API’s such as Kafka Connect API for data integration that pulls data from an external data source(DB, File System, Elastic Search, etc.) and push it into the Kafka topic, and Kafka Streams for stream processing that takes the data from Kafka and performs simple to complex transformation on it and put it back to Kafka. The advanced clients use producers and consumers as building blocks and provide higher-level functionality on top.
Producers behavior is to create new messages. In general, a message will be produced to a specific topic. By default, the producer doesn’t care which partition a specific message is written to and will balance messages over all partitions of a topic evenly. In some cases, the producer will direct messages to specific partitions. This is typically done using the message key and a partitioner that will generate a hash of the key and map it to a specific partition.
Consumers behavior is to read messages. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset is another bit of metadata that is an integer value that continually increases that Kafka adds to each message as it is produced. By storing the offset of the last consumed message for each partition, either in Zookeeper or in Kafka(in a topic called __consumer_offsets) itself, a consumer can stop and restart without losing its place.
Consumers work as part of a consumer group, which is one or more consumers that work together to consume a topic. In this way, consumers can horizontally scale to consume topics with a large number of messages. Additionally, if a single consumer fails, the remaining members of the group will rebalance the partitions being consumed to take over for the missing member. Consumer groups are used for scalable message consumption. Each different application will have a unique consumer group. Kafka Broker manages the consumer group.
6. Brokers and Clusters
A single Kafka server is called a broker. The broker receives messages from producers, assigns offsets to them, and commits the message to storage on disk. It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to the disk.
Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller(elected automatically from the live members of the cluster). The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. A partition may be assigned to multiple brokers, which will result in the partition being replicated. This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure. However, all consumers and producers operating on that partition must connect to the leader.
A key feature of Kafka is that of retention, which is the durable storage of messages for some period of time. Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time(7 days) or until the topic reaches a certain size in bytes(1 GB). Once these limits are reached, messages are expired and deleted so that the retention configuration is in a minimum amount of data available at any time. Individual topics can also be configured with their own retention settings so that messages are stored for only as long as they are useful.
7. Kafka as a Distributed Streaming System
What is a Distributed System ?
Distributed Systems are a collection of systems working together to deliver a value.
Key characteristics of distributed system
- Availability and Fault Tolerance: If one of the systems goes down, still this won’t impact the overall availability of the systems. The client request will be handled gracefully.
- Reliable Work Distribution: The requests that is received from the clients in general are equally distributed between the available systems.
- Easily Scalable: Adding new systems to the existing setup is easy.
How Kafka Cluster distributes the client requests
- How topics are distributed ?
First create request to create a new topic to Apache Zookeeper. Then Zookeeper takes care of redirecting the request to the controller. The role of this controller is to distribute the ownership of the partitions to the available broker. This concept of distributing partitions to the broker is called leader assignment. So the topic is distributed across the Kafka Cluster.
2. How Kafka distributes client requests: Kafka Producer
Kafka Producer’s produce messages and send it to the partitioner. The partitioner checks which partition this message should be sent, and sends it to the leader of the determined partition. After that the message is persisted into the file system of the Broker.
The client request from the producer are distributed between the brokers based on the partition. Which indirectly means that the load is distributed between the brokers.
3. How Kafka distributes client requests: Kafka Consumer
Kafka Consumers’s pull the messages from its configured topic. The request goes to all of the partitions and retrieve the messages from them. The retrieved messages are handed over to the consumer, and the consumer process the messages successfully.
Even from the consumer perspective, the request to retrieve data are distributed between brokers. Basically the client will only go to the partition leader of the topic and retrieve the data.
- Partition leaders are assigned during topic creation.
- Clients will only invoke the leader of the partition to produce and consume data.
- Load is evenly distributed between the brokers.
8. How Kafka handles data loss ?
For example, when the Kafka Producer sends a message to Partition 0, then Broker 1 will receive the message and Broker 1 will be the leader replica of Partition 0. Now there is one copy of the actual message, but since the Replication Factor(number of copies of the message)is three, there should be two more copies of the same message. So rest of the Brokers will replicate the Partitions to themselves, and it is called the follower replica.
Lets say that Broker 1 is down. But still the data of the partition is available in broker 2 and broker 3, so there will be no data loss. Zookeeper gets notified about the failure and it assigns the new leader to the controller.