Automated data testing for disparate data sources

Automated data testing for disparate data sources

Disparate source data  testing is a major challenge in IT industry. In our last article 10 Key Performance Indicators to check for quality data after ETL, we have seen how the data leak may happen and different checks during ETL development. In this article we will see how to perform automated data testing for disparate data sources such as xml to Excel, MySQL to SQL, Json to Postgres etc.

Helical has come up with new and unique product based on Helical Scrunch for comparing data from desperate data sources and can produce result at individual cell level. The terminology used in below explanation is Expected and Actual. Expected data is records to be compared with which may be in one of the source or may be prepared by developer. Actual represents Actual source data may be from xml, csv, mysql or json or any source data. After comparing the data, following differences are available for analysis.

Comparison Indicators

Disparate data source comparison

Disparate data source comparison

  • Additional columns in Expected
  • Additional columns in Actual
  • Additional records in Expected
  • Additional records in Actual
  • Mismatch Records from Expected
  • Number of Records in Expected
  • Number of Records in Actual
  • Rejected/Mismatched/Additional records in flat file / database column for analysis
  • Row number of rejected/mismatched record
  • Column name of rejected/mismatched record
  • Actual rejected/mismatched record from Expected

Using this tool it is effortless task to compare diverse data sources in no time with required details for data comparison.

Test scenarios

  • MySQL – SQL Server
  • Excel – MySQL
  • XML – MySQL
  • CSV – SQL Server
  • MySQL – MySQL

Limitation which is found in this case is that if your source data is not moved as it is to target, then the developer needs to create Expected data to be compared with target and this is time consuming. But, if you have data available in two different sources, then task of data comparison becomes effortless.

Applications

  • Data Migration
  • Data Testing
  • Data warehousing
  • Data synchronization
  • Data consolidation

Advantages

Usually when developing larger projects having 100s of database tables or data from different sources it is difficult to perform data testing on each record and verify whether the records are properly inserted into the target or not. This tool ensures that even after performing changes to ETL, if you execute this comparison test, it will provide you with the comparison results for all the tables. This is one time setup and you can execute it multiple times for data verification.

  • Can perform or simulate data testing for every ETL execution
  • Saves time and efforts of developers for regression testing
  • Ensures correctness of data
  • Improves quality of project
  • With slighter modification, this can be utilized for incremental and initial load as well.

So, if you have any daunting task of comparing records and identifying defaulters in financial, relational or any diverse data sources, please contact us and we would give you the solution which will match up to your expectations.

———————————————————————-

Vaibhav Waghmare (DW Architect), Helical IT Solutions

Helical IT Solutions

 

 

10 Key Performance Indicators to check for quality data after ETL

10 Key Performance Indicators to check for quality data after ETL

Have you ever tested the data considering all the options or scenarios for data leakage? Imagine you have a source data which is converted/migrated by ETL tool or some process into the target

Have you checked your source and target data

Data testing

data. This conversion or migration process continues on day to day basis for months and years. How many times you tested for the correctness of the source and target data as per the business rules and expectations? There are chances of wide gap between the source and target. This article shows how data can leak and still show that the ETL is executed correctly without any error. Again many factors are also dependent on ETL tool. This article shows 10 key performance indicators to check for quality data after ETL. I try to focus all these aspects independent of ETL tool.

ETL Key Performance Indicators for data testing

 

1)  Source reject – Assume that developer created proper metadata for source. But the source contains data which differs with the metadata. In this case, the data will be outright rejected from the source. It will not come into the mainstream for ETL processing. Nobody will come to know that it was rejected from the source. It should be zero.

2) Target reject – Same as above, developer created a metadata for target. But the processed data contains different data which does not match with the target metadata. The data is rejected before inserting in to the target. If you don’t capture, it is lost. It should be zero.

3) Source is having higher data – When you compare source data and target data after completion of ETL, you may see that source is having higher records than target, reason can be anything. It should be zero when compared with the target.

4) Target is having higher data – This scenario exists when source data is deleted but still it exists in the target. This needs target data updating at the time of regular ETL. It should be zero when compared with the source.

5) Source data mismatches with the Target – This happens during ETL, some transformation is applied on source or during migration from one data source to another, it would automatically convert the data and insert into the target. In short, source is different than target. Data mismatch need to be verified and this count should be zero.

6) Target data mismatches with the Source – This happens when the data update takes place on source, but not changed in the target. Data mismatch need to be verified and this count should be zero.

7) Timestamp issue – This is one of the most promising issues during ETL process. When data comes from different time zones, collected at central place and then brought to the staging for Warehousing. Usually timestamp is used for incremental load, this timestamp creates a problem which skips data for specific time zone

  1. Another issue in time zone is timestamp – All of the databases do not support the entire time stamp format. While converting data from one database to another database, this time conversion needs to be taken care of.

8) Time lag during ETL – Many times ETL job runs for few hours on every day basis. Need to have proper signature for ETL completion. If the signature of ETL completion is not proper and if some data insertion or update happens during the period of ETL, then certainly at consecutive execution the job is going to skip few records. Need special method to track the source and target data.

9) Null handling and metadata conversion – Every time when ETL is executed, proper care for null handling and metadata conversion is to be taken care. Need schema compliance check.

10) Monitoring – After completion of ETL it is always required that proper auditing and logging is implemented. This is required for ensuring that everything is working as expected. All the above points and other items need to be logged during ETL project development. Need reporting bursting on the audited and logged data.

In addition to this there are many activities and practices to follow to ensure that the data movement from source to target is as expected. If it is not ensured then, it directly reflects in the data loss which is a revenue loss for a company. Because, if the source data for the dashboard is wrong, it would be shown as it is in the Dashboard or report. This is false and partial data and not corrects figures. Many times, this is not identified unless it is compared with some other data. This comparison is a far remote chance in today’s fast food and high speed data requirement. In many cases error is not generated during ETL process, which gives false notice that the ETL is working fine.

Are you doing all of these during your ETL project development? If not, ensure that you do all these next time when you start with your ETL project. Whether you are a client or a developer or business owner, if you are moving data from one location to other location, try to ensure all of above points are checked or implemented during ETL activity. Or contact us for quality ETL development.

Vaibhav Waghmare , DW Architect, LinkedIn