Your Guide to How Apache Kafka Works

6 min readJul 21, 2022

Apache Kafka is an open-source platform that handles data in motion. It works by relaying the streaming data and delivering it to its ultimate destination. In other words, Kafka can be defined as a messaging system that makes communication easier and the management of data pipelines simpler. Modern organizations use Apache Kafka to manage notably high throughputs. This platform was originally developed by LinkedIn but later donated to Apache Foundation.

Kafka Architecture

It is essential to understand the overall structure of Apache Kafka to grasp how it functions. Any process, in Kafka, starts with the writing of an event, which is a message containing relevant data. A producer is the entity (for instance, a server, an application, an IoT device, etc.) that produces this message and sends it to Kafka. The concerned cluster communicates this message to a consumer — which reads the data. Here, the consumer can be another application, a data lake, or a database.

Kafka has a distributed architecture, meaning that it works through multiple Kafka clusters that further consist of numerous servers. These servers are known as brokers, and each one of them can have one or more topics. You can imagine topics as folders that contain events in an accurate sequence. These categories can be used to fetch data readily.

A topic is further divided into many partitions that come with sequential ids. These ids are known as offsets, and they assist quick retrieval and storage of data. Partitions make the smallest functional units of Apache Kafka. It is important to determine the data rate of a partition beforehand. It helps in analyzing how much retention space is available. In other words, a data rate dictates how much data can be retained in a given amount of time.

Kafka is fault-tolerant and reliable because it generates multiple copies of data set within a cluster. Therefore, there is no data loss even if a node (Kafka broker) goes down. The user can easily access it via other brokers. To enable this property, you need to activate the Kafka Replication Factor.

The replication factor determines how many copies of a data will be stored in different clusters. For example, if it is set as 3, then three copies of the data will be made.

Practical Example

Let’s suppose you have been assigned a task. In the project, you are supposed to help a company decongest a national highway by analyzing the road traffic with the help of data collected from different toll plazas.

As a data engineer, your job, expectedly, will be to extract information regarding the specifics, i.e., vehicle ids, types, toll plaza ids, and timestamps, and finally feed this data into a database.

For this purpose, you will be required to develop a data pipeline and stream the data into Kafka. To get the job done, you need to have access to the Kafka server that comes with the following elements:

· Zookeeper Server

· Broker Server

· Producer (e.g., cameras) and Consumer

To simulate the environment, you can use the created Python Scripts. MySQL database can be used for saving the harvested data.

Follow the below-mentioned steps to get the job done.

First off, start MySQL. Once it provides a username and password, use the information to get into the MySQL query.

Create a database “tolldata” and table “livetolldata” along with the relevant fields.

Installing the following two modules will help you communicate Python with Kafka and SQL server.

Start the Zookeeper server with its config file. In the file, you just have to specify the directory of the Kafka folder and the Client port.

Now, start the Kafka broker server.

Create the topic “toll” into which data will be streamed.

Run Python Script as Producer. Values, for instance, a vehicle id, are specified here and broadcasted to the Kafka cluster. Eventually, the python script starts simulating the environment of a toll plaza.

Until now, the producer was producing the data being fed to the cluster. Now you have to initiate a consumer that will consume the data and save it in the SQL database. To achieve this, you need to execute the python script to enable the consumer to read data into the database. Again, you will import some libraries with specifics. You have to indicate the topic from where the data will be streamed and specify the database which will be the destination point of that streamed data.

Once the database is connected, a loop will start running for every vehicle passing the toll plaza.

Let’s query the database to check for toll data and its status.

The whole process can be summarized in a simple diagram given below.

**Single Broker Cluster of Kafka Architecture**

What Plus Points Give Kafka an Edge?

Apache Kafka has become one of the widely used tools for handling streaming data. It provides the following benefits to its users:

· High scalability

· Support for high throughput

· Fault-tolerance

· Zero data loss

· Resilience and durability

When to Use Kafka?

Apache Kafka finds its use cases in:

· Stream processing

· Activity tracking

· Log aggregation

· Data processing

· Generating metrics

Conclusion

In a nutshell, Apache Kafka can be defined as a stream-processing platform that brings simplicity and scalability to data management. Kafka is capable of processing large volumes of data in real-time. It can add as many machines to the system as needed. Since the platform acts as a middleman between servers, Kafka deployment makes working with data in motion much more effortless.

Data engineers and scientists at Royal Cyber Inc. are experienced in using Apache Kafka to process, store, and analyze streaming data. Feel free to contact the experts to get your queries answered.

Author bio:

Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and data analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning for energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.