What is Kafka anyway? History of kafka
Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems.
This (Kafka 'log') architecture also raises a set of different options for where a particular cleanup or transformation can reside:
1. It can be done by the data producer prior to adding the data to the company wide log.
2. It can be done as a real-time transformation on the log (which in turn produces a new, transformed log)
3. It can be done as part of the load process into some destination data system
The best model is to have cleanup done prior to publishing the data to the log by the publisher of the data. This means ensuring the data is in a canonical form and doesn't retain any hold-overs from the particular code that produced it or the storage system in which it may have been maintained. These details are best handled by the team that creates the data since they know the most about their own data. Any logic applied in this stage should be lossless and reversible.
- tldr; The log is source of truth, build of that. Even consensus algorithms lean on the notion of a log to work.
When to use Kafka, contrasting kafa with rabbitmq
- To deduplicate (https://segment.com/blog/exactly-once-delivery/)
- Or never allow duplicate data or enter at-all (e.g. put a 'have I seen this
idbefore infront of kafka)
Re-ordering messages within topics
You can't but you can put them somewhere else (extract, load) and order them e.g: https://developer.ibm.com/hadoop/2018/07/01/use-redis-as-a-cache-mechanism-to-handle-out-of-order-kafka-messages/
Kafka + Python:
With the right settings, you can enable idempodence at the produce/pid layer: e.g. using aiokafka when configuring the producer set