Capture data change

.no-js-Capture-d--cran-2020-09-09---11 { display: none !important; }

MOUKHLISS Amine

06 September, 2020 · 4min 📖

Or briefly named CDC, is a design pattern that let us track data that has changed, in a more concrete way we are going to explore two implementations of CDC:

Fake CDC with Kafka connect:

The fake CDC uses time periodic queries (aka : polling) to extract data from database, we choose Kafka to implement Fake CDC, since it already has a built in JDBC connector (to read and load data from our database) and Kafka Stream API that let us SQL query (KSQL) the loaded data,

Figure 1: Capture data change using polling/periodic queries, Kafka Connect

Real CDC:

Oracle golden gate/licence, Debezium/free using RDBMS logs to capture data changes, each RDBMS has a transaction log that record every action executed by a data management system, this log can be consulted using extensions (Debezium/OGG…)

Figure 2: Capture data change using RDBMS log trail, Oracle Golden Gate

In depth analysis:

Our case study will use measure both solution based on those specific criteria:
• Reliability & Impotency • Requirement • Performance • Complexity • Pricing • Community

Reliability & Idempotent:

JDBC connector:

let’s say we have two instances that want to persist some data, the first one has a range of ids from 1 to 10 and the second one a range from 11 to 20, if the second instance persist its data first then Kafka will load the data and send it to the designated topics, but when the first instance persist its data, Kafka will not load the data that is a huge drawback for using it as reliable source of events,
CDC, with log based CDC no data loss is detected, since the mechanism doesn’t have to keep track of incremental id’s,

Requirement:

Incrementing or/and timestamp based extraction is quite limited, since it supposes you are dealing with a modern pattern RDBMS, some legacy systems weren’t conceived with that in mind, and don’t have a version column to deal with updated data, nor incremental technical id’s,
No requirements are needed to add CDC when using Oracle GG,

Performance issue:

As stated the JDBC connector use JDBC time periodic queries to load data from database into Kafka topics thus putting the database under stress,
On the other hand, real CDC solution use a backdoor (database log) which does not put the database under stress,

Complexity:

It’s pretty straight forward to use the Kafka JDBC (specially when using the Kafka control centre) connector no code is needed, you just have pin point the database and a template for the topics and it’s done,
OGG or Debezium needs access to read logs from RDBMS, and a specific configuration to work,

Pricing:

The Kafka JDBC connector is free, on the other hand RDBMS vendors did not judge that exposing their logs should be native thus most solution don’t come free,
Debezium is a free/open source solution.

Community:

Kafka connector is backed by Confluent and have a great support by the community and it’s improving day by day, Oracle golden gate is supported by Oracle and is feature complete, Debezium is the open source community favourite,

Aggregate and data transformation:

Data extracted from database is just raw events, it will be difficult to manage data based on business need, an extended logic is expected on the consumer end, mostly data will not be used as it is, since only aggregate events should be sent but in this case end user will have to handle raw events, a lot of processing is to be expected, this when Kafka Stream and KSQL come in handy adding the possibility to aggregate and transform data,

Kafka-Stream:

The data loaded by the previous step (whether be it Fake or Real CDC) is streamed into Kafka,
One can then choose to either use the data streamed as a table or a stream each having their particularities:

Table:

Is a snapshot of the events, represents a snapshot of the stream at a time, and therefore it has its limits defined.

Stream:

Is an unbounded stream of event, any new data that comes in gets appended to the current stream and does not modify any of the existing records, making it completely immutable.

KSQL:

One can use KSQL a language used for querying in the stream of data, but its limited as it is not as strong as SQL can be, it’s still in it early age, SQL had built over more than 40 year to get this strong,

Conclusion:

Kafka connector/Stream/KSQL is a good match but not a product we advise you to use to capture data change (for the time being) and it has to do with the way the solution is implemented fake CDC will just put your RDBMS on much load stress, on the other hand Real CDC uses an elegant solution to begin with reading from RDBMS logs is just a precise and clever way to get data without putting you database under any stress.

Miscellaneous:

For the sake of this ticket we only illustrated one way to expose your data (via Kafka topics/subscribe) but of course you can use any protocol available out there that will meet your requirements.

MOUKHLISS Amine

Tags:

Back End

Kafka

CDC

MOUKHLISS Amine has no other posts

Aloha from xHub team 🤙

We are glad you are here. Sharing is at the heart of our core beliefs as we like to spread the coding culture into the developer community, our blog will be your next IT info best friend, you will be finding logs about the latest IT trends tips and tricks, and more

Never miss a thing Subscribe for more content!

xTechnologies

Capture data change

MOUKHLISS Amine

MOUKHLISS Amine

MOUKHLISS Amine has no other posts

Aloha from xHub team 🤙

💼 Offices

📞 Contact Us:

🤳🏻 Follow us: