Change data capture :
It’s the process of tracking the data that has changed so that corresponding action can be taken. Its nothing but capturing the changes which are made on the data source and applying it at all the other places, and hence it reduces the efforts required at the ETL step.
Different strategies :
- Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process.
- Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model.
Usage of Change Data Capture:
Change data capture is often used in data warehousing since it involves extraction and transformation of data from one or more databases to data warehouse for analysis.
Prior to CDC, the change data was captured using other methodologies i.e. table difference or table comparison. Its like taking a copy of the source and comparing with target, spotting the difference. For limited number of records this approach might work, but for more number of records this methodology will be highly inefficient and result in immense overheads.
Table differencing: It involved copying entire table from source to staging. But since, the entire table was shifted, there were additional overheads and this method was not efficient. Also, in case if there are multiple changes in the data, it cant be captured using this methodology.
Change value selection: This involves capturing source database changed values based on the value of a specific colum. However for implementing this we need to fire queries to find out the changed data which is an additional overhead. Again, in this method also we will not be able to find or capture the intermediate values. Also the changes happening between firing of two queries will not be captured.
A slightly better approach than the ones mentioned above is by using database triggers. This triggers can be used after any Insert/Delete/Modify statement whenever there is any change, the changed information will be passed on to a table. The only disadvantage of this solution is it is database specific and migration of such a solution becomes complex since its db specific.
Another approach would be the usage of log readers. Many Enterprise DBMS have capability of transaction logs that can be used (can be switched on and off), the amount & details of information in log to be stored can be decided as well. By reading this we can come to know of the changes happening in the system. The problem with this is purchasing of new component for log reading, migration might be an issue, also some DBs might not have the logging features etc. Also sometimes some changes which are not committed, their logs are also created which could be as well an issue.
Capturing data with CDC:
CDC can capture data using two modes :: Synchronous and Asynchronous. In synchronous mode, any changed data in source system is informed immediately to the target using DML (data manipulation language) operations, whereas that’s not the case with Asynchronous mode. Synchronous uses triggers and it results in no latency or delay, though it has more overheads as compared to asynchronous.
Benefits of CDC
– CDC can improve performance and reduce redundancy by copying only the changed data.
– Log based CDC also reduces load on operational systems since it only reads the logs rather than db
– No latency, hence end business users can have immediate realization of the same in their BI
– CDC can easily handle high volume transactions
We can implement CDC using Talend Integration Suite. More information about the same will be covered in another blog.
For any query, please reach out to me on [email protected]