Notes on Kafka
What is Kafka anyway? History of kafka
Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems.
This (Kafka 'log') architecture also raises a set of different options for where a particular cleanup or transformation can reside:
1. It can be done by the data producer prior to adding the data to the company wide log.
2. It can be done as a real-time transformation on the log (which in turn produces a new, transformed log)
3. It can be done as part of the load process into some destination data system
The best model is to have cleanup done prior to publishing the data to the log by the publisher of the data. This means ensuring the data is in a canonical form and doesn't retain any hold-overs from the particular code that produced it or the storage system in which it may have been maintained. These details are best handled by the team that creates the data since they know the most about their own data. Any logic applied in this stage should be lossless and reversible.
- tldr; The log is source of truth, build of that. Even consensus algorithms lean on the notion of a log to work.
From: https://content.pivotal.io/blog/messaging-patterns-for-event-driven-microservices
When to use Kafka, contrasting kafa with rabbitmq
https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka
Depuplicating messages
- To deduplicate (https://segment.com/blog/exactly-once-delivery/)
- Or never allow duplicate data or enter at-all (e.g. put a 'have I seen this
id
before infront of kafka) - https://dzone.com/articles/interpreting-kafkas-exactly-once-semantics
- https://jack-vanlightly.com/blog/2018/10/25/testing-producer-deduplication-in-apache-kafka-and-apache-pulsar
Re-ordering messages within topics
You can't but you can put them somewhere else (extract, load) and order them e.g: https://developer.ibm.com/hadoop/2018/07/01/use-redis-as-a-cache-mechanism-to-handle-out-of-order-kafka-messages/
Idempotent producers
Kafka + Python:
With the right settings, you can enable idempodence at the produce/pid layer: e.g. using aiokafka when configuring the producer set enable_idempotence=True
.
-- misc