Cassandra Installation and Cluster Setup on Ubuntu

Cassandra Installation and Cluster Setup

Cassandra Introduction

Apache Cassandra is a free and open-source distributed NoSQL database management system. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Cassandra prerequisites

Minimum hardware requirements:

CPU: 2 cores


Software requirements:

Java 8

Installation and Cluster Setup Steps

To setup a cassandra cluster, we need to install cassandra on each server which will be the part of the cluster.

Here, we are setting up 4 node cassandra cluster.

Suppose we are having 4 nodes, namely,





Note: In order to set up a cluster, perform the following steps on each involved server.


  1. Install Java 8 on each server using following commands on the terminal. (Optional if already installed)
    1. To make the Open JDK package available, you’ll have to add a Personal Package Archives (PPA) using this command.
      sudo add-apt-repository ppa:openjdk-r/ppa
    2. Update the package database.
      sudo apt-get update
    3. Now install the Java 8.
      sudo apt-get install openjdk-8-jdk
  2. Install cassandra on each servers using the following list of commands on the terminal. Here we are installing Cassandra version 3.10.
    1. echo "deb 310x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

      Note 310x means we are installing cassandra of version 3.10. You can change this value if you want to install different version of Cassandra.

    2. Add the Apache Cassandra repository keys using following command.
      curl | sudo apt-key add -
    3. Update the package database using following command.
      sudo apt-get update

      Note If you encounter any public key related error, such as “The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA”.

      Then add the public key A278B781FE4B2BDA as follows:

      sudo apt-key adv –keyserver –recv-key A278B781FE4B2BDA

      Now repeat ‘sudo apt-get update‘. The actual key may be different, you get it from the error message itself.

    4. Now install the Cassandra using following command.
      sudo apt-get install cassandra

      By now, you have installed cassandra on each involved server.
      Go through the following steps to setup cluster.

Cluster Setup Steps

  1. Stop cassandra service on each server (XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3,XXX.XX.XX.4) using following command.
    sudo service cassandra stop
  2. Delete default cassandra data from each server located in data directory of cassandra using following command.
    sudo rm -rf /var/lib/cassandra/data/system/*
  3. Now open the cassandra.yaml file located in /etc/cassandra directory on each server one after another to configure cluster.
    Note Following are the parameters from cassandra.yaml file that needs to be configured in order to setup cluster.
    cluster_name : You change the cluster name by setting this parameter. It is optional.
    -seeds : This is a comma-delimited list of the IP address of each node in the cluster.
    listen_address : This is IP address that other nodes in the cluster will use to connect to this one. It defaults to localhost and needs changed to the IP address of the node.
    rpc_address : This is the IP address for remote procedure calls. It defaults to localhost. If the server’s hostname is properly configured, leave this as is. Otherwise, change to server’s IP address or the loopback address (
    endpoint_snitch : Name of the snitch, which is what tells Cassandra about what its network looks like. This defaults to SimpleSnitch, which is used for networks in one datacenter. In our case, we’ll change it to GossipingPropertyFileSnitch, which is preferred for production setups.
    auto_bootstrap : This directive is not in the configuration file, so it has to be added and set to false. This makes new nodes automatically use the right data. It is optional if you’re adding nodes to an existing cluster, but required when you’re initializing a fresh cluster, that is, one with no data.

    1. Open cassandra.yaml on XXX.XX.XX.1 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.1
       rpc_address: XXX.XX.XX.1
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    2. Open cassandra.yaml on XXX.XX.XX.2 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.2,XXX.XX.XX.1,XXX.XX.XX.3,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.2
       rpc_address: XXX.XX.XX.2
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    3. Open cassandra.yaml on XXX.XX.XX.3 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.3,XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.3
       rpc_address: XXX.XX.XX.3
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    4. Open cassandra.yaml on XXX.XX.XX.4 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.4,XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3"
       listen_address: XXX.XX.XX.4
       rpc_address: XXX.XX.XX.4
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
  4. Start the Cassandra daemon on each server on each server.
    sudo service cassandra start

    Note You can check the status of cassandra using following command.

    sudo service cassandra status
  5. Check the status of the cassandra cluster using nodetool utility command. Fire command on one node say XXX.XX.XX.1 as follows,
    sudo nodetool status

    You will get output like,

    Datacenter: DC1
    |/ State=Normal/Leaving/Joining/Moving
    -- Address     Load       Tokens Owns (effective) Host ID                              Rack
    UN XXX.XX.XX.1 30.94 GiB  256    100.0%           ae45c3c5-30e3-4d60-9e5c-3f0004bbfea4 rack1

    Note You’ll find that only the local node is listed, because it’s not yet able to communicate with the other nodes.

  6. Allow communication by opening the following network ports on each node using the following commands.
    sudo ufw allow 7000

    7000 is the TCP port for commands and data.

    sudo ufw allow 9042

    9042 is the TCP port for the native transport server. cqlsh, the Cassandra command line utility, will connect to the cluster through this port.
    Note You can change the port number as required by changing it in cassandra.yaml
    storage_port: 7000
    native_transport_port: 9042

  7. Now Check the status of the cassandra cluster again.
    sudo nodetool status

    Output would be like this,

    Datacenter: dc1
    |/ State=Normal/Leaving/Joining/Moving
    -- Address     Load       Tokens Owns (effective) Host ID                              Rack
    UN XXX.XX.XX.3 31.3 GiB   256    100.0%           20412cb7-b9bf-4373-bf6f-d0958a993c95 rack1
    UN XXX.XX.XX.4 30.78 GiB  256    100.0%           a49ee892-f816-4e0d-84b4-3b3bf1421f33 rack1
    UN XXX.XX.XX.1 30.94 GiB  256    100.0%           ae45c3c5-30e3-4d60-9e5c-3f0004bbfea4 rack1
    UN XXX.XX.XX.2 32.5 GiB   256    100.0%           f9084fd5-06f4-4408-ba90-32c9afd60ede rack1

You can see that all the nodes are up and normal.

In this way, you can install and setup cassandra cluster.


In this blog, we have discussed how can we install and setup the cassandra cluster.




Java Naming and Directory Interface (JNDI)

Using JNDI, applications running in distributed environment, helps to look up services independently.

How JNDI is useful?

  1.  During deployment sequence, the same JNDI name can be used in different servers.
  2.  It helps to provide security of credential leakage to the user groups like developer, testers, stagers etc, because the credentials are configured in the server not in the application.
  3.  Connection Pooling can be achieved

How to use JNDI?
In order to use, JNDI classes and service providers should be there
The service provider may be :

  1. Lightweight Directory Access Protocol (LDAP)
  2. Common Object Request Broker Architecture (CORBA)
  3. Common Object Services (COS) name service
  4. Java Remote Method Invocation (RMI) RegistryThe objects needs to be created and registered so that , they can be further used for lookup.

How to configure JNDI in tomcat?
In tomcat environment we can configure JNDI in 3 different places
1. Inside the application using context.xml in META-INF directory.
2. Inside the tomcat server context.xml : Inside the apache-tomcat/conf directory we can configure. This can be helpful when multiple application needs the same database.
3. Both in server.xml and context.xml

NB: The database driver jar should be present in the tomcat lib directory

<Resource name="jdbc/lookupName" 


In order to use it in java program we have to use the followng
DataSource ds = (DataSource) ctx.lookup(“java:/comp/env/jdbc/lookupName”);

Let’s Have Fun with Triggers in SQL

Let’s Have Fun with Triggers in SQL

What are Triggers?

Triggers are a Block of Structure that executes when a DML Statement is performed or Executed. Triggers are fun to use but from a business point of view not advisable but that a different story entirely ie the Pro’s and Con’s. but for gaining an idea about Triggers here is what the Video Covers

  1. Basic Creation of a Trigger
  2. and a simple real-time scenario on how i used triggers

The Video here illustrates or covers the above points.


While working with database using plain old JDBC, it becomes cumbersome to write unnecessary code to handle exceptions, opening and closing database connections etc. But Spring JDBC Framework takes care of all the low-level details starting from opening the connection, prepare and execute the SQL statement, process exceptions, handle transactions and finally close the connection.


Spring JDBC provides several approaches and correspondingly different classes to interface with the database

The Spring Framework’s JDBC abstraction framework consists of four different packages, namely








1.1 org.springframework.jdbc.core.simple

1.2 org.springframework.jdbc.core.namedparam

  1. Datasource



2.1 org.springfamework.jdbc.datasource.embedded





This package contains the JdbcTemplate class and its various callback interfaces, plus a variety of related classes.

It’s subpackage named org.springframework.jdbc.core.simple contains the SimpleJdbcInsert and SimpleJdbcCall classes. Another subpackage named org.springframework.jdbc.core.namedparam contains the NamedParameterJdbcTemplate class and the related support classes.


The org.springframework.jdbc.datasource package contains a utility class for easy DataSource access, and various simple DataSource implementations that can be used for testing and running unmodified JDBC code outside of a Java EE container. A subpackage named org.springfamework.jdbc.datasource.embedded provides support for creating in-memory database instances using Java database engines such as HSQL and H2.


Here, we are using the JdbcTemplate object in the EmployeeDao class, so we are passing it by the setter method but you can use constructor also.

<?xml version=”1.0″ encoding=”UTF-8″?>






<bean id=”ds” class=”org.springframework.jdbc.datasource.DriverManagerDataSource”>

<property name=”driverClassName” value=”oracle.jdbc.driver.OracleDriver” />

<property name=”url” value=”jdbc:oracle:thin:@localhost:1521:xe” />

<property name=”username” value=”system” />

<property name=”password” value=”oracle” />


<bean id=”jdbcTemplate” class=”org.springframework.jdbc.core.JdbcTemplate”>

<property name=”dataSource” ref=”ds”></property>


<bean id=”edao” class=”com.helical.hdidao”>

<property name=”jdbcTemplate” ref=”jdbcTemplate”></property>



DB Partitioning


Partitioning enhances the performance, manageability, and availability of a wide variety of applications and helps reduce the total cost of ownership for storing large amounts of data. Partitioning allows tables, indexes, and index-organized tables to be subdivided into smaller pieces, enabling these database objects to be managed and accessed at a finer level of granularity.

Partitioning is a rather general concept and can be applied in many contexts. When it considers the partitioning of relational data,

it usually refers to decomposing your tables either row-wise (horizontally) or column-wise (vertically).

–       Horizontal partition(Row-wise partitioning) :- It involves putting different rows into different tables. Perhaps customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest. The two partition tables are then CustomersEast and CustomersWest, while a viewwith a union might be created over both of them to provide a complete view of all customers.

–       Vertical partition(Column-wise partitioning) :- It involves creating tables with fewer columns and using additional tables to store the remaining columns. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized. Different physical storage might be used to realize vertical partitioning as well; storing infrequently used or very wide columns on a different device,

for example, is a method of vertical partitioning. Done explicitly or implicitly, this type of partitioning is called “row splitting” (the row is split by its columns). A common form of vertical partitioning is to split dynamic data (slow to find) from static data (fast to find) in a table where the dynamic data is not used as often as the static. Creating a view across the two newly created tables restores the original table with a performance penalty, however performance will increase when accessing the static data e.g. for statistical analysis.

Other Example:–

assume you have a table like this:
  create table data (    id integer primary key,     status char(1) not null,     data1 varchar2(10) not null,     data2 varchar2(10) not null);

One way to partition data vertically: Split it as follows:
create table data_main (    id integer primary key,    status char(1) not null,    data1 varchar2(10) not null ); create table data_rarely_used (    id integer primary key,    data2 varchar2(10) not null,    foreign key (id) references data_main (id) );

This kind of partitioning can be applied, for example, when you rarely need column data2 in your queries. Partition data_main will take less space, hence full table scans will be faster and it is more likely that it fits into the DBMS’ page cache. The downside: When you have to query all columns of data, you obivously have to join the tables, which will be more expensive that querying the original table.

Notice you are splitting the columns in the same way as you would when you normalize tables. However, in this case data could already be normalized to 3NF (and even BCNF and 4NF), but you decide to further split it for the reason of physical optimization.

One way to partition data horizontally, using Oracle syntax:
create table data (    id integer primary key,     status char(1),     data1 varchar2(10),     data2 varchar2(10) )    partition by list (status) (        partition active_data values ( ‘A’ ),       partition other_data values(default)     );

This would tell the DBMS to internally store the table data in two segments (like two tables), depending on the value of the column status. This way of partitioning data can be applied, for example, when you usually query only rows of one partition, e.g., the status ‘A’ rows (let’s call them active rows). Like before, full scans will be faster (particularly if there are only few active rows), the active rows (and the other rows resp.) are stored contiguously (they won’t be scattered around pages that they share with rows of a different status value, and it is more likely that the active rows will be in the page cache.

Different types of models used by databases

There are a lot of databases out there in the market, lets understand the models types and the corresponding usage.

Different types of models used by databases
• Relational DBMS                          • Wide column stores                                   • Content stores
• Document stores                          • Event stores                                                • Graph DBMS
• Key-value stores                          • Multivalue DBMS                                        • Native XML DBMS
• Navigational DBMS                      • Object Oriented DBMS                             • RDF Stores
• Search Engines


Relational databases: It is a collection of data items organized in set often called as tables. All the tables are related to each other. SQL statements are used to fetch, update, and delete the information from these tables.
Popular relational databases are
Oracle                                                                         • MySQL
Microsoft SQL Server                                               • PostgreSQL


Wide column stores: This type of databases stores data in a form of records with an ability to hold any number of dynamic columns. This type of databases doesn’t follow a specific schema. You can think of storing of the data in two-dimensional key-value.
Popular wide column stores databases are
Cassandra                                               • HBase                                                       • Accumulo


Content stores: They are also called as content repositories, specialized in management of digital content, such as text, pictures or videos, including their metadata. Some of the features included are full text search, versioning, hierarchical structured content, and access control.
Popular content stores databases are
Jackrabbit                                                                • Modeshape


Document stores: They are also called as document oriented databases systems, and characterized by their schema-free organization of data. They store records and each record may have different columns. Document stores often use internal notations, which can be processed directly in applications, mostly JSON.
Popular document stores databases are
MongoDB                                 • CouchDB                                                       • Couchbase


Event stores: They persists all state changing events for an object together with a timestamp, thereby creating time series for individual objects. The current state of an object can be inferred by replaying all events for that object from time 0 till the current time.
Popular event stores are
InfluxDB                                          • Event Store


Graph DBMS: Graph DBMS, also called graph-oriented DBMS or graph database, represent data in graph structures as nodes and edges, which are relationships between nodes. They allow easy processing of data in that form, and simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node.
Popular graph event database are
Neo4j                                                • Titan


Key-value stores: They can only store pairs of keys and values, as well as retrieve values when a key is known. One of the examples of the key-value stores is mentioned below
Key Value
Student1_maths 100
Student1_english 80
Student2_maths 100

Popular Key-value stores are
DynamoDB                                           • Redis                                                       • Memcached


Multivalue DBMS: It is also called as multidimensional database. It is very similar to Relational databases, however it differ from relational databases in that they have features that supports to use of attributes of values, rather than all attributes being single-valued.
Popular Multivalue databases are
Adabas                       • D3                          • UniData,Universe                                 • jBASE


Native XML DBMS: This type of databases internal data model corresponds to XML document. Native XML DBMS do not necessarily store data as XML documents, they can use other formats for better efficiency.
Generally defines a logical model for an XML document – as opposed to the data in that document – and stores and retrieves documents according to that model. Has an XML document as its fundamental unit of storage.
Popular Native XML DBMS
MarkLogic                    • Sedna


Navigational DBMS: This type of databases allows access to data sets only via linked records. They were the first established systems able to manage large amounts of data.
Popular Navigational DBMS
IMS                              • IDMS


Object Oriented DBMS: This type of databases stores the information in the form of objects as used in the object oriented programming. Object oriented databases are different from relational databases which are table oriented.
An object oriented DBMS follows an object oriented data model with classes (the schema of objects), properties and methods. An object is always managed as a whole. This means for example, that the insertion of an object, which in a relational system would probably be stored in multiple tables, will be performed automatically as one atomic transaction – without any action by the application program. Reading an object can also be done as a single operation and without complex joins.
There are tools and architectures that are now provided for the storage of objects into relational databases (such as Hibernate or JPA).
Popular object oriented databases are
Cache                                         • ObjectDB


RDF Stores: The Resource Description Framework (RDF) is a methodology for the description of information, originally developed for describing metadata of IT resources. Today it is used much more generally, often in connection with the sematic web, but also in other applications.
The RDF model represents information as triples in the form of subject-predicate-object.
Database management systems, which are able to store and process such triples, are called RDF stores or triple stores.
Popular RDF stores are
AllegroGraph                               • Jena


Search Engines: Search engines are NoSQL database management systems dedicated to the search for data content. In addition to general optimization for this type of application, the specialization consists in typically offering the following features:
• Support for complex search expressions
• Full text search
• Stemming (reducing inflected words to their stem)
• Ranking and grouping of search results
• Geospatial search
• Distributed search for high scalability
Popular search engines are:
Elasticsearch                               • Solr                                           • Sphinx

Change Liferay Database

This blog will talk about how to change liferay database.

Liferay comes with a default database called “HSQL” or “Hypersonic”. This is not meant for production use however! You might need to switch to a real database to use Liferay. This page documents how to change the default database system.

Database configuration in 6.0, 6.1 is the same as 5.2.

If you need to change the database of liferay, you can choose any database e.g MySQL, PostgreSQL etc. You can change database of liferay in two ways

1. If you are installing liferay change database at that time.

2. If you have already installed liferay, you will have to make some changes.

Change Database:

If you have installed LFR, take backup of your LFR data (called Data Migration).  If you want to take backup of stored LFR data follow these steps:

1. Create a database lportal (or any name) in your database.

2. Login as administrator in LFR.

3. Select Go To –> Control Panel.

4. Select Server –> Server Administration.

Liferay change database

5. Select Data Migration Tab

Liferay change database 2

6. Fill all the details (JDBC Driver Class Name, JDBC URL, JDBC User Name, JDBC Password) in text boxes.

Liferay change database 3

(For Postgres SQL DB)


7. Select Execute

See the LFR server console, You can see the process of data migration. When the process is completed, restart the LFR Server.

As you can see in lportal database, there is lots of tables which contains LFR data.


Connect LFR to database:

Now, we have to connect LFR to database. The configuration of the database is set in a single file, To store the configuration of database, follow these steps:

1. Open new text file.

2. Write configuration of database, e.g

For PostgreSQL:



[email protected]@@@


For MySQL:



jdbc.default.username= @@@@

jdbc.default.password= $$$$$


3. Save as in {liferay-home}\{tomcat}\webapps\ROOT\WEB-INF\classes

4. Restart Liferay Server.

Now, LFR database has been changed.


NOTE:   If you are going through Data Migration Process, then JDBC URL ( jdbc:mysql://localhost:3306/lportal  in this example) should be same at both place, when you are doing Data Migration & when you are writing Configuration in


Sharad Sinha

[email protected]

Helical IT Solutions

Creating a Federated Tables inside MySQL

What is Federated Table?

A federated database system is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database.

A Federated Table is a table which points to a table in another MySQL database instance (mostly on another server). It can be seen as a view to this remote database table. Other RDBMS have similar concepts for example database links.

federated table

Federated Tables Concept In MYSQL(linux) :

  1. Federated Storage Engine : Using Federated Storage Engine,we will be able to access data from tables of Remote databases rather than local databases. This Federated Engine concept is available from MYSQL 5.3.0 Onwards

Example : Suppose you want to access some data from the DataBase that is on one server1,

Also another set of data from different server2, but data on these two servers are not accessible through out the world , Rather these are available only on one server3 , so what you can do is :

Simply create “Federated” tables for all those tables from which you want to access data on server3 from server1 & server2. When using a FEDERATED table, queries on the server3 are automatically executed on the remote (federated) tables. No data is stored on the local tables.

NOTE :  By default Federated Storage Engine will not enabled on the mysql

Federated Table 2But if you want to verify whether federated engine is on or off, “SHOW ENGINES” sqlquery can help to identify the status.

federated table 2If value of Support column is “YES” then Federated Engine is Turned On, Else If it is “No” it means it is Turned Off.

To enable federated engine ,

  1. You add the line ‘federated’ to the /etc/my.cnf file after  [mysqld] section
  2. restart the mysql server

Federated Table 1

  1. Federated Table Creation : It has two elements :
    1. Remote server : Data is physically stored in this server
    2. Local Server :      No data is stored here,simply you can fire the query and you will be able to see your required output.

Example :

  1. Suppose you have a table on the Remote Server (say Master Table : “student_test”)

Data Base Name is :  STUDENT_DB that you want to   access by using a Federated table


CREATE TABLE student_test(







2. Next,You have to create a table on the local database server(say Federated Table :”Student_federated)” DataBase Name is STUDENT_FEDERATED to access the data on the Master Table


CREATE TABLE Student_federated(






CONNECTION=’scheme://username:[email protected]:port/DBname/TableName;

For the above scenario , The connection String should be :

CONNECTION=’mysql://root:[email protected]:3306/STUDENT_DB /Student_test;


Sequence of execution:

– When a client issues an SQL statement that refers to a FEDERATED table, the flow of information between the local server (where the SQL statement is executed) and the remote server (where the data is physically stored) is as follows:

– The storage engine looks through each column that the FEDERATED table has and constructs an appropriate SQL statement that refers to the remote table.

– The statement is sent to the remote server using the MySQL client API.

– The remote server processes the statement and the local server retrieves any result that the statement produces (an affected-rows count or a result set).

If the statement produces a result set, each column is converted to internal storage engine format that the FEDERATED engine expects and can use to display the result to the client that issued the original statement.


Now You are able to get your required output by firing sql query



Helical IT Solutions

Change Data Capture (CDC) – Methods, Approaches, Alternatives

Change data capture :

It’s the process of tracking the data that has changed so that corresponding action can be taken. Its nothing but capturing the changes which are made on the data source and applying it at all the other places, and hence it reduces the efforts required at the ETL step.


Different strategies :

  • Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process.
  • Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model.


Usage of Change Data Capture:

Change data capture is often used in data warehousing since it involves extraction and transformation of data from one or more databases to data warehouse for analysis.

Prior to CDC, the change data was captured using other methodologies i.e. table difference or table comparison. Its like taking a copy of the source and comparing with target, spotting the difference. For limited number of records this approach might work, but for more number of records this methodology will be highly inefficient and result in immense overheads.


Table differencing: It involved copying entire table from source to staging. But since, the entire table was shifted, there were additional overheads and this method was not efficient. Also, in case if there are multiple changes in the data, it cant be captured using this methodology.


Change value selection: This involves capturing source database changed values based on the value of a specific colum. However for implementing this we need to fire queries to find out the changed data which is an additional overhead. Again, in this method also we will not be able to find or capture the intermediate values. Also the changes happening between firing of two queries will not be captured.


A slightly better approach than the ones mentioned above is by using database triggers. This triggers can be used after any Insert/Delete/Modify statement whenever there is any change, the changed information will be passed on to a table. The only disadvantage of this solution is it is database specific and migration of such a solution becomes complex since its db specific.


Another approach would be the usage of log readers. Many Enterprise DBMS have capability of transaction logs that can be used (can be switched on and off), the amount & details of information in log to be stored can be decided as well. By reading this we can come to know of the changes happening in the system. The problem with this is purchasing of new component for log reading, migration might be an issue, also some DBs might not have the logging features etc. Also sometimes some changes which are not committed, their logs are also created which could be as well an issue.


Capturing data with CDC:

CDC can capture data using two modes :: Synchronous and Asynchronous. In synchronous mode, any changed data in source system is informed immediately to the target using DML (data manipulation language) operations, whereas that’s not the case with Asynchronous mode. Synchronous uses triggers and it results in no latency or delay, though it has more overheads as compared to asynchronous.  
Benefits of CDC

–          CDC can improve performance and reduce redundancy by copying only the changed data.

–          Log based CDC also reduces load on operational systems since it only reads the logs rather than db

–          No latency, hence end business users can have immediate realization of the same in their BI

–          CDC can easily handle high volume transactions


We can implement CDC using Talend Integration Suite. More information about the same will be covered in another blog.


For any query, please reach out to me on [email protected]