Apache Kafka Core Concepts
In this article, I will introduce you to Apache Kafka. We will try to understand
Kafka and some core concepts associated
with it. I am assuming that you have at least heard about Apache Kafka and you
already
know that it is an Open Source project. Kafka was initially developed at LinkedIn
and
later open sourced in 2011. Since then it has evolved and established itself as a
standard
tool for building real-time data pipelines. Now it is securing its share in
real-time
streaming applications as well.
The Kafka documentation says it is a distributed streaming platform. That is
good
for definition. But I want to know what it can do for me or what I can do using
Apache
Kafka.
A messaging system
The official documentation says it is similar to enterprise messaging system. I guess, you already understand a messaging system. In a typical messaging system, there are three components.
- Producer
- Broker
- Consumer
The producers are the client applications, and they send some messages. The Brokers receive messages from publishers and store these messages. The consumers read the message records from brokers.
Apache Kafka Use Case
A typical messaging system appears very simple. Now let us look at the data integration problem in a large organization. I borrowed the below figure from Jey Creps blog.
The first part of the diagram shows the data integration requirement in a large
enterprise. Does it look like a mess? There
are many source systems and multiple destination systems. And you are given a task
to
create data pipelines to move data among these systems. For a growing company, the
number
of source and destination systems keep getting bigger and bigger. Finally, your
data
pipeline becomes a mess. Some part of your pipeline will keep breaking every day.
However, if we can use a messaging system for solving this kind of integration
problem,
the solution may be neater, and cleaner as shown in the second part of the above
figure.
That's the idea discovered by the team at LinkedIn. Then they started evaluating
existing
messaging systems, but none of them meet their criteria to support the desired
throughput
and scale. Finally, they end up creating Kafka.
What is Apache Kafka?
So at the core, Kafka is a highly scalable and fault tolerant enterprise messaging system. Let us look at the Kafka diagram from official documentation.
The first thing is a producer. I hope you already understand producers. They send
messages to Kafka cluster.
The second thing is the Cluster. A Kafka cluster is nothing but a bunch of
brokers
running in a group of computers. They take message records from producers and store
it
in Kafka message log.
At the bottom, we have consumer applications. The consumers read messages from
Kafka
cluster, processes it and do whatever they want to do. They might send the messages
to
Hadoop, Cassandra, HBase or may wish to push it back again into Kafka for someone
else
to read those modified or transformed records.
Now let us turn our focus on other two things in the above diagram.
Kafka offers a fantastic throughput and scalability that you can easily handle
a
continuous stream of messages. So, if you can just plug in some stream processing
framework
to Kafka, it could be your backbone infrastructure to create a real-time stream
processing
application. Hence diagram shows some stream processing applications. They read a
continuous
stream of data from Kafka, process them and then either store them back to Kafka or
send
them directly to other systems. Kafka provides some stream processing APIs as well.
So
you can do a lot of things using Kafka stream processing APIs, or you can use other
stream
processing frameworks like Spark streaming or storm.
The next thing is Kafka connector. These are another compelling features. They
are
ready to use connectors to import data from databases into Kafka or export data
from
Kafka to databases. These are not just out of the box connectors but also a
framework
to build specialized connectors for any other application.
Kafka is a distributed streaming platform. You can use it as an enterprise
messaging
system. That doesn't mean just a traditional messaging system. You can use it to
simplify
complex data pipelines that are made up of a vast number of consumers and
producers.
You can use it as a stream processing platform. It also provides connectors to
export
and import bulk data from databases and other systems.
However, implementing Apache Kafka is not that simple. There is no plug and
play
component. You need to use APIs and write a bunch of code. You need to understand
some
configuration parameters and tune or customize Kafka behavior according to your
requirement
and use case.
Let's talk about some basic concepts associated with Kafka. I will try to
introduce
you to the following terminologies.
- Producer
- Consumer
- Broker
- Cluster
- Topic
- Partitions
- Offset
- Consumer groups
What is a Kafka producer?
The producer is an application that sends data. Some people call it data, but we will call it a message or a message record. These messages can be anything ranging from a simple string to a complex object. Ultimately it is a small or a medium-size piece of data. The message may have different meaning or schema for us. But for Kafka, it is a simple array of bytes. For example, if I want to push a file to Kafka, I will create a producer application and send each line of the file as a message. In that case, a message would be a line of text. But for Kafka, it is just an array of bytes. Similarly, If I want to send all the records from a table, You would send each row as a message, or if I want to send the result of a query. I will create a producer application, fire a query against my database, collect the result and start sending each row as a message. So, while working with Kafka, if you want to send some data, you have to create a producer application.
What is a Kafka consumer?
So the consumer is an application that receives data. If producers are sending
data, they must be sending it to someone,
right? The consumers are the recipients. But remember that the producers don't send
data
to a recipient address. They just send it to Kafka server. And anyone who is
interested
in that data can come forward and take it from Kafka server. So, any application
that
requests data from a Kafka server is a consumer, and they can ask for data send by
any
producer provided they have permissions to read it.
So just continuing on the file example, If I want to read the file sent by a
producer,
I will create a consumer application, then I will request Kafka for the data. The
Kafka
server will send me some messages. So the client application will receive some
lines
from Kafka server, it will process them and again request for some more messages.
The
client keeps asking data and Kafka will keep giving message records as long as new
messages
are coming from the producer.
What is a Kafka broker?
The broker is the Kafka server. It's just a meaningful name given to the Kafka server. And this name makes sense as well because all that Kafka does is act as a message broker between producer and consumer. The producer and consumer don't interact directly. They use Kafka server as an agent or a broker to exchange messages.
What is a Kafka cluster?
If you have any background in distributed systems, you already know that a cluster is a group of computers acting together for a common purpose. Since Kafka is a distributed system, so the cluster has the same meaning for Apache Kafka. It is merely a group of computers, each executing one instance of Kafka broker.
What is a Kafka topic?
We learned that producer sends data to the Kafka broker. Then a consumer can ask for data from the Kafka broker. But the question is, Which data? We need to have some identification mechanism to request data from a broker. There comes the notion of the topic. So the topic is an arbitrary name given to a data set. We better say that it is a unique name for a data stream. For example, We create a topic called Global Orders, and every point of sales may have a producer. They send their order details as a message to the single topic named as Global Orders. And a subscriber interested in Orders can subscribe to the same topic.
What is a Kafka partition?
By now, you learned that the broker would store data for a topic. This data can be
enormous. It may be larger than the storage
capacity of a single computer. In that case, the broker may have a challenge in
storing
that data. One of the prominent solutions is to break it into two or more parts and
distribute
it to multiple computers.
Kafka is a distributed system that runs on a cluster of machines. So it is
evident
that Kafka can break a topic into partitions and store one partition on one
computer.
And that's what the partition means.
You might be wondering that how Kafka will decide on the number of partitions.
I
mean, some topics may be large, but others may be relatively small. So how Kafka
knows
that it should create 100 partitions or just ten partitions should be enough.
The answer is simple. Kafka doesn't take that decision. We have to make that
decision.
When we create a topic, we make that decision, and Kafka broker will create that
many
partitions for your topic. But remember that every partition sits on a single
machine.
You can't break it again. So do some estimation and simple math to calculate the
number
of partitions.
What is a Kafka offset?
The offset is simple. It is a sequence number of a message in a partition. This
number is assigned as the messages arrive
at the partition. And these numbers once assigned, they never change. They are
immutable.
This sequencing means that Kafka stores messages in the order of arrival within
a
partition. The first message gets an offset zero. The next message receives an
offset
one and increments on to the following message. But remember that there is no
global
offset across partitions. Offset are local to the partition. So if you want to
locate
a message, you should know three things. Topic name, Partition number, an offset
number.
If you have these three things, you can directly locate a message.
What is a consumer group?
It is a group of consumers. So many consumers form a group to share the work. You
can think of it that there is one large
task and you want to divide it among multiple people. So, you create a group, and
members
of the same group share the work. Let me give you an example.
Let's assume that we have a retail chain. In every store, there are few billing
counters.
You want to bring all of the invoices from every billing counter to your data
center.
Since you learned Kafka and you find Kafka as an excellent solution to transport
data
from billing locations to the data center. You decided to implement it.
The first thing you might want to do is to create a producer at every billing
site.
These Producers will send bills as a message to a Kafka topic. The next thing you
might
want to do is to create a consumer. The consumer will read data from the Kafka
Topic
and write them into your data center. It sounds like a perfect solution. Right?
But there is a small problem. Think of the scale. You have hundreds of
producers
pushing data into a single topic. How will you handle that volume and velocity? You
learned
Kafka exceptionally well. So you decided to create large Kafka cluster and
partition
your topic. So your topic is partitioned and distributed across the cluster. Now
several
brokers are sharing the workload to receive and store data. From the source side,
you
have many producer and several brokers to share the workload. What about the
destination
side?
You have a single unfortunate consumer. There comes the consumer group. You
create
a consumer group and start executing many consumers in the same group, and tell
them
to share the workload. So far so good. But how do we split the work?
That's not a difficult question. I have 600 partitions. And I am starting 100
consumers.
So each of the consumers takes six partitions. We will see, If they can't handle
six
partitions, we will start some more consumers in the same group. We can go up to
600
consumers, so each of them will have just one partition to read.
If you followed this example correctly, You understand that partitioning and
consumer
group is a tool for scalability. And also realize that the maximum number of
consumers
in a group is equal to the total number of partitions you have on a topic. Kafka
doesn't
allow more than one consumers to read from the same partition simultaneously. This
restriction
is necessary to avoid double reading of records.