flexview-cdc

Flexview CDC Implementation

This CDC utility identifies row changes in the database, and takes few action on them like writing them to a log file table or file. In order to update views incrementally, Flexviews needs to have detailed information of each row where change is made in the database. FlexCDC reads MySql binary logs, which tracks those row changes .

1. FlexCDC triggers MySqlbinlog in an external procedure with the following commandline options ‘–base64-output=decode-rows -v’. This tells the utility to transform RBR base64 entries into readable SBR notation.

2. Then MySqlbinlog connects to database-MySql and queries for binary logs. The output from MySqlbinlog is stored by FlexCDC and proceeds.

3. FlexCDC gathers the changes into log tables rather than applying the actual changes. FlexCDC assigns a unique sequentially increasing transaction id to every set of changes made. FlexCDC inserts every set of changes into one table change logs or more than one table change logs (one per changed table) as a single transaction.

image

Figure 2.5 Flexview CDC Flowchart


Note: FlexCDC do not replace MySql replication. However there is an experimental class called FlexSBR which involves support for replication of SBR binary logs. It is mainly a proof-of-concept. It does not work with RBR.

Types of CDC supported by Talend Enterprise Edition

Types of CDC supported by Talend Enterprise Edition :

 

CDC (Change Data Capture ) :

The CDC architecture is based on the publisher/subscriber model.

The publisher captures the change data and makes it available to the subscribers. The subscribers utilize the change data obtained from the publisher.

The main tasks performed by the publisher are:

-Identifying the source tables from which the change data needs to be captured.

-capturing the change data and storing it in specially created change tables.

Following are the two different types of CDC modes available in Talend Enterprise Edition.

1.Trigger mode
2.Redo/Archive log Mode

Trigger mode :

In this mode we need to setup the CDC feature in Talend Studio so that it will create required triggers on the selected database.

Following are the two major steps required to establish the CDC environment on Talend Studio:

1.Configure CDC in Trigger mode

2.Extract change data modified in Trigger mode

Supported databases for CDC under Trigger mode :

CDC feature available only for DB2, Informix, Ingres, MS SQL Server, MySQL, PostgreSQL ,Oracle and Sybase.

Talend provides different CDC component for each supported database.
Some of them are list below :

tMySQLCDC, tMSSQLCDC,tDB2CDC,tInformixCDC,tOracleCDC,tPostgreSQLCDC and SybaseCDC

Redo/Archive log mode :

The Oracle Redo/Archive log mode used in Talend is equivalent to the archive log mode of Oracle.

The CDC feature in this mode for Oracle is available for the 10g version of the Oracle database and later versions. Before being able to use CDC in Redo/Archive log mode in Talend Studio, the administrator of the database to be supervised should do the following:

1.Activate the archive log mode in the Oracle database.

2.Set up CDC in the Oracle database.

3.Create and give all rights to the source user.

4.Create and give all rights to the publisher.

However,if you want to use CDC in Redo/Archive log mode for an Oracle, you must first of all configure the database so that it generates the redo records that hold all insert, update or delete changes made in datafiles.

Supported databases for CDC under Redo/Archive log mode :

CDC feature available only for Oracle and AS400.

Talend components for this mode are tAS400CDC and tOracleCDC

 

Change Data Capture (CDC) – Methods, Approaches, Alternatives

Change data capture :

It’s the process of tracking the data that has changed so that corresponding action can be taken. Its nothing but capturing the changes which are made on the data source and applying it at all the other places, and hence it reduces the efforts required at the ETL step.

 

Different strategies :

  • Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process.
  • Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model.

 

Usage of Change Data Capture:

Change data capture is often used in data warehousing since it involves extraction and transformation of data from one or more databases to data warehouse for analysis.

Prior to CDC, the change data was captured using other methodologies i.e. table difference or table comparison. Its like taking a copy of the source and comparing with target, spotting the difference. For limited number of records this approach might work, but for more number of records this methodology will be highly inefficient and result in immense overheads.

 

Table differencing: It involved copying entire table from source to staging. But since, the entire table was shifted, there were additional overheads and this method was not efficient. Also, in case if there are multiple changes in the data, it cant be captured using this methodology.

 

Change value selection: This involves capturing source database changed values based on the value of a specific colum. However for implementing this we need to fire queries to find out the changed data which is an additional overhead. Again, in this method also we will not be able to find or capture the intermediate values. Also the changes happening between firing of two queries will not be captured.

 

A slightly better approach than the ones mentioned above is by using database triggers. This triggers can be used after any Insert/Delete/Modify statement whenever there is any change, the changed information will be passed on to a table. The only disadvantage of this solution is it is database specific and migration of such a solution becomes complex since its db specific.

 

Another approach would be the usage of log readers. Many Enterprise DBMS have capability of transaction logs that can be used (can be switched on and off), the amount & details of information in log to be stored can be decided as well. By reading this we can come to know of the changes happening in the system. The problem with this is purchasing of new component for log reading, migration might be an issue, also some DBs might not have the logging features etc. Also sometimes some changes which are not committed, their logs are also created which could be as well an issue.

 

Capturing data with CDC:

CDC can capture data using two modes :: Synchronous and Asynchronous. In synchronous mode, any changed data in source system is informed immediately to the target using DML (data manipulation language) operations, whereas that’s not the case with Asynchronous mode. Synchronous uses triggers and it results in no latency or delay, though it has more overheads as compared to asynchronous.  
Benefits of CDC

–          CDC can improve performance and reduce redundancy by copying only the changed data.

–          Log based CDC also reduces load on operational systems since it only reads the logs rather than db

–          No latency, hence end business users can have immediate realization of the same in their BI

–          CDC can easily handle high volume transactions

 

We can implement CDC using Talend Integration Suite. More information about the same will be covered in another blog.

 

For any query, please reach out to me on [email protected]