In this blog post, I will give the high-level overview of Open Source system called Apache Kafka, which had been developed at LinkedIn.
- Why Apache Kafka & what it is
- Use Cases of Apache Kafka
- Who uses Kafka?
- Kafka Use cases
- Kafka Ecosystem
- Key Terminology
- Kafka Ecosystem: Extended API
- Kafka Ecosystem: Confluent Components Schema Registry and REST proxy
- Kafka in the Enterprise Architecture
- Kafka Ecosystem: Administration and Monitoring Tools
What is Kafka?
Briefly, Apache Kafka is a distributed streaming application.
Why Apache Kafka?
I believe you might have a question in mind why Apache Kafka? so let’s take a very simple example.
At first, you have source system and one target system.This could be your website and database.It’s very simple at first.
Then it becomes complicated because you may have many source systems.It may be your websites, your email clients or whatever and then you may have many target systems as well.So you may have many databases and various other data tools.It’s quietly complicated when you have to integrate every source system data with every target system.It’s implementation is tedious and there might be various protocols.
So Apache Kafka comes in a middle, It helps you in decouple with the data streams.Basically, as an example on top, we have all sources website events, pricing data, Financial transactions and user interactions. They all publish data to Apache Kafka.And your databases, your analytics, email systems and your audits just take the data from Kafka and do whatever they wanted to do with it.
The thing is there is middle layer apache Kafka which allows you to decoupled your data streams.For your website events, they are only worried about pushing data to Kafka and that’s it.
And for your database, they can resource any data from any source just from an Apache Kafka.So it’s a very very strong concept here.
So Apache Kafka on the top of decoupling do little more things
• Distributed, resilient architecture, fault tolerant
• Horizontal scalability
• High performance — real-time
it is distributed, resilient architecture, fault tolerant that’s because of many servers.Kafka provides High performance and Horizontal scalability which means we can add servers and scale to as many servers as required.
Use Cases of Apache Kafka
• Messaging System
• Activity Tracking
• Gather metrics from many different locations
• Application Logs gathering
• Stream processing (with the Kafka Streams API or Spark for example)
• De-coupling of system dependencies
• Integration with Spark, Flink, Storm, Hadoop, and many other Big Data technologies
Key Terminology: –
• Topics – Kafka maintains feeds of messages in categories called topics.
• Producers-Processes that publish messages to a Kafka topic are called producers
• Consumers-Processes that subscribe to topics and process the feed of published messages are called consumers.
• Broker– Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
• Communication between all components is done via a high performance simple binary API over TCP protocol
so here’s a bit of introduction I want to give you into the Kafka ecosystem and it’s really important to understand. so the Kafka Core API have been here for forever and in the centre, we have kept Kafka and usually we have source system which could be an Http endpoint, could be a database or it could be a whatever.
There will be a producer to produce data directly from whatever you want to just insert directly straight into Kafka. And that’s first very important API to learn.
Ideally, you want to move that data from Kafka to target System.To be consumed by a target system and for that purpose, we create consumers and consumers again is one of the core Kafka API. And finally, Kafka Cluster is managed by the zookeeper.
Kafka Ecosystem: Extended API
Kafka recently extended the core API’s again we have Kafka cluster and source system and they said well listen,
and it seems everyone is the same source systems or very similar source systems and everyone trying to write the same code over and over, so instead of they created an API called Kafka connect.
Kafka connects basically allows people to publish connectors, basically reusable piece of code.You just have to configure to get ETL from the source system to Kafka.basically that means you don’t have to code as much you just need to reuse of someone else’s code and just can configure it for your purpose.
so as sources one could be a database, elastic search or Twitter or whatever and then the same ways we have a target system and we have Kafka Connect sink API to sink data from Kafka into the target system.
And again the target system can be whatever you want. It could be a database, Amazon S3 or whatever.
Also, you may want to transform your data before you sync it by using Kafka connect.And in transformation you may want to do could be just simple mappings, it could be simple trimming, filtering, some more complex information such as aggregation counts etc and for o this, there is something called Kafka Stream API.
It could do a lot of things like these transformations, it’s a real-time streaming that’s a big competitor to spark streaming.
finally, when you get to multi-data-centre Kafka you would have second Kafka Cluster and something called Mirror maker in between just to replicate once cluster to another.So that’s the Kafka Extended API.
Kafka Ecosystem: Confluent Components Schema Registry and REST proxy
We have Confluent EcoSystem.Confluent is a private company behind the Apache kafka.So although Apache Kafka is open source .Confluent has created it’s proprietary components and some are open source to enhance the Kafka features.
So we have a Kafka Cluster and we have schema registry.Your java producer same as before, they will send Avro Data to Kafka Cluster and this scheme itself will be stored in the Kafka schema registry.Basically, Schema Registry allows your message to be standard, much more compacts and just strong enforcing the scheme.
similarly, likewise we have java consumers these consume from the Kafka cluster and read schema from Kafka schema registry and consume Avro data, deserialize it and get the strong guarantee about the schema as well.
Producers will not send whole Avro data, will send only the content of the Avro data.They will send schema to Schema registry. It’s really good component to know but it’s only for Java.
So what confluent says the many languages will not implement that serialization etc/So we will have a Rest proxy, again it’s by confluent.
Kafka Rest Proxy allows the non-java producer to just do HTTP Post request and analyse the Schema Registery to push Data to Kafka. Like wise we will have non java consumers to do Http get to consume data from Kafka over Rest Proxy.
Kafka in the Enterprise Architecture –
So finally here is what Kafka look like in typical enterprise. It may be different for each enterprise but usually, this is pretty common architecture shown here.
So we have in left side data producers and this could be anything apps, website, financial systems, email, databases and they all push data to Kafka.
From there we have two pipelines of data
So first in real-time Pipeline usually have components like Spark, storm, link.They will read the data from Kafka in real time and forms the real-time analytics, dashboard, alerts, it’s endless but it’s real time.
Second is Batch timeline in which whole data of Kafka will push to the data layer.It could be Amazon S3 or even RDBMS.And out of it can be formed Analytical reports, Alerts etc.
To understand whole framework architecture better see infographic which is created by Nishant.
Who uses Kafka?
Used by the 2000+ companies who handle a lot of data –
❖ LinkedIn: Activity data and operational metrics
❖ Twitter: Uses it as part of Storm – stream processing
❖ Square: Kafka as the bus to move all system events to various Square data centres (logs, custom events, metrics, and so on). Outputs to Splunk, Graphite, Esper-like alerting systems
❖ NetFlix,Uber,Spotify, Uber, Tumbler,Box,Cisco,PayPal etc.
Kafka: Administration and Monitoring Tools
• Topics UI (Landoop) : View the content of the topic
• Schema UI (Landoop) : Explore the Schema registry
• Connect UI (Landoop) : Create and monitor Connect tasks
• Kafka Manager (Yahoo) : Overall Kafka Cluster Management
• Burrow (Linkedln) : Kafka Consumer Lag Checking
•Exhibitor (Netflix) : Zookeeper Configuration, Monitoring, Backup
•Kafka Monitor (Linkedln) : Cluster health monitoring
•Kafka Tools (Linkedln) : Broker and topics administration tasks simplified
• Kafkat (Airbnb) : More broker and topics administration tasks simplified
•JMX Dump : Dump JMX metrics from Brokers
•Control Centre / Auto Data Balancer / Replicator (Confluent) : Paid tools
That was all about Apache Kafta, I hope you all understood the post, if you have any doubts then feel free to comment and I will try to solve it. See you in the next post!