Kafka

Kafka is streaming ( can also be used as messaging, message storage) platform which provide best of both traditional messaging models:  message queues and publish-subscribe. At the same time it serves as durable distributed storage and provides in order guarantee of messages and above all it is scalable.

Kafka APIs :  Producer, Consumer , Stream ( Allow creating data processing pipelines ), Connector.

For more details see Kafka Introduction

Kafka clusters can span multiple data centers ( durable, distributed storage ) . Consider each topic like a folder in the file-system and each event as a file. Because each Kafka topic ( stream of records ) is stored in partitioned logs ( number is configurable) for configurable retention period, multiple clients can read those in the ways they want. Hence Kafka topics are multi-subscriber.  Also there is very less overhead per consumer.  The only metadata needed to be kept is offset ( in the partition ).  This offset is controlled by the consumer totally. So consumer can read in any order it wants.

The concept of partition in a topic is more like partition in a table(HDFS). It is meant for parallelism and storage scalability ( You can store different partitions of a topic on different servers but one partition has to be on one server and partitions are replicated. ) . Events with the same event key (e.g., a customer or vehicle ID) are written to the same partition, and Kafka guarantees that any consumer of a given topic-partition will always read that partition's events in exactly the same order as they were written. Also Kafka has concept of consumer groups where each consumer instance takes fair share of the partitions in a topic and processes those independently and in parallel.

Kafka has streaming API for writing more involved streaming applications. The streaming API is built on top of producer/consumer API utilizing Kafka as a stateful storage.

Comments

Popular posts from this blog

SQL

Analytics

HIVE