Today real-time content is continuously generated by a large number of systems. This information needs to be quickly routed to multiple types of consumers, but can render producers inaccessible. For this reason, a mechanism to integrate info from producers to consumers that should avoid code rewriting at both sides is needed. A great challenge in this scenario is the collection of the huge amount of data to be analyzed. Here’s where Kafka comes into its own. But what is Kafka? Well, in a nutshell, it’s an open source message broker written in Scala. Originally it has been developed by LinkedIn, but was released as open source in 2011 and is currently maintained by the Apache Software Foundation.

Why might one prefer Kafka to a traditional JMS message broker? Here are my two cents:

  • It’s fast: a single Kafka broker running on commodity hardware can handle hundreds of megabytes of reads and writes per second from thousands of clients.
  • Great scalability: it can be easily and transparently expanded without downtime.
  • Durability and Replication: messages are persisted on disk and replicated within the cluster to prevent data loss (setting a proper configuration using the high number of available configuration parameters you could achieve zero data loss).
  • Performances: each broker can handle terabytes of messages without performance impact.
  • It allows real time stream processing.
  • It can be easily integrated with other popular open source systems for Big Data architectures like Hadoop, Spark and Storm.

These are the core concepts of Kafka you need to become familiar with:

  • Topics: they are categories or feed names to which upcoming messages are published.
  • Producers: any entity that publishes messages to a topic.
  • Consumers: any entity that subscribes to topics and consumes messages from them.
  • Brokers: services that handle read and write operations.

The following diagram shows a typical Kafka cluster architecture:
Typical Kafka cluster architecture

Kafka uses ZooKeeper behind the scenes in order to keep its nodes in synch. The Kafka binaries provide it, so if the hosting machines don’t have ZooKeeper on board you can use the one that comes bundled with Kafka.

The communication between clients and servers happens using a high performant and language agnostic TCP protocol.

Although Kafka has been implemented in Scala, don’t worry if you are not familiar with this programming language. APIs to build producers and consumers are available in Java and other languages.

There are several use cases for Kafka. Here I am going to show a short list, but there are many more scenarios that you could add:

  • Messaging
  • Stream processing
  • Log Aggregation
  • Metrics
  • Web activities tracking
  • Event Sourcing

This was just a quick introduction to Kafka for newbies. In a next article we will go deep into its design details.

Introduction to Apache Kafka

| Cloud| 1,459 views | 1 Comment
About The Author
- I am a Big Data Infrastructure Engineer coming from a long past as a Java developer first and later on a test automation engineer. I am passionate about Java, Big Data Analytics, DevOps and Open Source software.

1 Comment

Leave a Reply to Sai N Cancel reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>