Mixing analytics and operational use, MS SQL server, debezium, MSK, kafka connect, mongodb, real-time syncing

AWS datalake real time sync with on-premise

Motivation of architecture choice

  • Amazon MSK, fully managed aspect, ease of setup and configuration, deployment inside vpc
  • Kafka, proven, the broad ecosystem including connectors
  • Debezium, true CDC for all major database technology
  • mongoDB, deployment on EC2 for the performance, and having all the data types being available

Context

The business critical application generates large volumes of data, but like many of these application, they are a bit egoistic about their database, and in particular on how they do CRUD operations. Often it’s case there’s absolutely no way to determine which records have been inserted, updated nor deleted. So if you want to replicate the data you are more or less forced to take full dumps.

This approach is valid for data systems that reflects the data of yesterday. But also reaches its limit when we talk about large(r) volumes. A clean and structural way to identify newly inserted, update and deleted records is using change data capture, CDC.

The solution is deployed in the context of analytics and operational use. With that in mind, we would immediately think object storage and AWS s3 as outcome of Kafka — this would be valid, if we would like to have the full history of updates a single record has undergone.

Our objective is to have a view in the cloud that reflects the current state of the application on premise, with a second of delay, this without all the history of updates a particular record has. Let’s say, the outcome of Kafka needs to be upserted per table key.

To handle upserts with object storage we could also use Apache Hudi or Delta lake and realize a lakehouse. The concept of lakehouse is certainly valid and my bet the future will evolve in this direction. But today this would complexity the picture, require additional skill-sets, and this for our current requirement “having the current state of the data replicated in real-time and be bulletproof”.

Change data capture

In order to guarantee we don’t introduce any additional load on the primary database, the database technology provides the capability to sync it’s primary database to a secondary (read) replica, this in a very non-intrusive way.
The CDC starts from the replica. Doing that we really don’t touch the operational database. We activate CDC on the replica database for the tables in scope and for each table we create a new composed field that is a concatenation of the primary key of the table, the pk_. This pk_ will be used by the mongoDB kafka connect sink config and mongoDB to identity which record needs to be inserted or updated.

CDC is provided by the database technology and each database has it’s specifics and flavours on how the CDC is realized. Debezium handles that for the major databases: MySQL, MongoDB, PostgreSQL, Oracle, SQL server, DB2, Cassandra and Vitess.

Apache Kafka

Apache Kafka is a community distributed event streaming platform capable of handling large volumes of data.
Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

Kafka connect

Kafka connect is 1 of the 4 API’s provided by Kafka, the connect API is particularly suited to keep a source database in sync with …

The connect portfolio consists of many type of connectors and makes it very easy to push/pull data with Apache Kafka. A source connector, takes data from a source and pushes it to Kafka; a sink connector takes data from Kafka and pushes to a sink.

  • A source connector can be: sftp, jdbc, debezium CDC, Splunk, Redis, SAP ( always check the terms before doing ), …
  • A sink connector can be: Salesforce, sftp, http, …

Connectors are offered under: open source, community, commercial, premium and Confluent-verified partner offered.

Ounce the connector is running in standalone or in group for HA, we pass the config via the connect REST interface.
Each connect config has off-course a bit his specifics, this in function of the source or sink we want to talk to.

An example for MS SQL source connector:


Posting this config to Kafka connect would start the synchronisation, with the 8 steps in the snapshotting process. A log of demo.dbo.ORDERS is illustrated below.

An example for mongodb sink connector

The mongodb connector support multiple write model strategies, those will define how records will be written to mongodb. We opted to use Inserted and Modified Timestamps

Further we also configured a transform action: unwrap of the type ExtractNewRecordState this to only have the after image of the CDC.
And having a datastructures in mongodb on 1 single level, that corresponds exactly to what we have in the Microsoft SQL database with 2 additional timestamps: _insertedTS: ISODate("2021-10-01T09:38:14.659Z") and _modifiedTS: ISODate("2021-10-01T10:38:14.659Z").

Kafka connect is definitely an asset in the Kafka ecosystem, also AWS has extended it’s MSK offering with a managed connect offering.
This certainly facilitates all the operational aspects of running Kafka connect.

MongoDB

The first taught when you have chosen AWS as cloud provider is probably why not using doucmnetDB? The fully managed version of mongoDB.
DocumentDB was the primary choice but very soon we encountered those two shortcomings:

  • missing of the mongodb decimal data type, that is used all over the place in our source DB. And yes you can do an SMT in connect and cast to another data type, knowing that afterwards all consumers will need to cast again each field
  • performance, the ingestion from Kafka connect towards documentdb is much, much slower (+- 50 inserts per second) compared to mongoDB on EC2 ( +- 8000 inserts per second ), compared on equal ground

For the time being we run mongoDB on EC2 under docker and will do what’s needed for operations or move to the mongodb atlas offering.

Get your playground

A complete demo recording of the case can be found here.
The repo on GitHub.

Please get in touch if you would like to kow more on this topic!

Have fun!

Thierry

References