Helical Insight for ETL Tester (QA)

Helical Insight for ETL Tester (QA)

Let’s take an example that your ETL developer has just wrote an ETL script, which loaded the data into the Data Warehouse, few millions of records got inserted into the Data Warehouse.

For QA, it is super difficult to test all (millions) of the data and also a time consuming process. Moreover, there are many other problems like

  • some might not be well versed with writing complex SQL queries
  • There might be multiple data sources involved in generating the Data Warehouse.
  • Writing multiple queries on multiple data sources and then combining it in one single result set would be a sweat breaking task.

Now think of Helical Insight as your friend which helps you to do most of your tasks, allowing you to focus on more critical thinking rather than just doing stare and test whole day. Wondering how Helical Insight can help you achieve the same?

Here is your answer,

  • Helical Insight is capable of communicating with any of your data source(s) like CSV, Excel, MySql, Oracle, SQL Server, Postgres etc.
  • Helical Insight has a drag and drop feature to make your test cases ready, without even having a prior knowledge of SQL queries. Helical Insight generates the SQL queries including the joins between multiple entities. Helical Insight understands your data structure, hence it generates the query even if you are working with different set of data source(s), DW or normalized database, Helical Insight understands all.
  • Using this (drag and drop) facility you can generate all your test cases in a form of report, which you can call it as your test cases. For example, your report does a count of records for a dimension. You can save this report and check the same with source system by creating a separate report. If its a migration project the count should definitely match otherwise we know something is wrong with the ETL Script.
  • Once you have saved these reports (test cases) you can re-executes them again an again to verify your data. In the above situation you have just identified that Helical Insight is really a very useful tool with some really cool features like generating the SQL query for any database or data source(s), ability to execute the queries on any databases or data source(s), ability to save the report / test cases / SQL queries for future usage.

Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 2

Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 2

In part 1, I have mentioned about passing parameter values down to sub-job/transformation using setter/getter methods.

Please refer my previous post for part 1 Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 1

But in this part we will use executor steps to do the same process.

Part 2 : Using job/transformation executor steps :

In order to pass the parameters from the main job to sub-job/transformation,we will use job/transformation executor steps depends upon the requirement.

Following are the steps :

1.Define variables in job properties section
2.Define variables in tranformation properties section
3.Configure job/transformation executor step

1.Define variables in job properties section :

Right-click any where on your job and select settings and go to parameters section.Define all the variables and assign some default values to each.


2.Define variables in transformation properties section :

Right-click any where on your sub job/transformation and select settings and go to parameters section.
Use the same variables that you have defined in your parent job (i.e.Step1) and assign some default values to each.
3.Configure job/transformation executor step :

In this step,we have to configure sub job/transformation path details and need to pass the same parameters in job entry details section like below.

Double click on job/transformation executor step and provide transformation file path details.


Go to parameters section and make sure you have checked the Pass all parameter values down to the sub-transformation check box.


So now you can utilize same variables in your sub-transformation.




Staging DB in Datawarehousing – The WHYs and WHY NOTs

Staging DB in Datawarehousing – The WHYs and WHY NOTs

Whenever we talk about designing datawarehouses, staging database is one consideration we have to make. And it depends on many factors whether you choose to have one or not. I feel when we start small, like implementing small datamarts, we may not see the need for a staging database but as the DW implementations grow in size and complexity, staging DBs have a lot of benefits to offer. Staging DBs are often the place to dump source data, the workplace to create temperory intermediate tables for processing the transformation steps in ETLs, to keep what is good, to reject what is bad and so on.

Lets quickly look through the PROs and CONs of having Staging DBs in the DW implementations:

The WHYs

Source data accumalation: If data is being pulled from multiple systems Source systems and not all data is available at the same time, we may need to accumalate all raw data pulled at different times and then integrate/merge and load the same.

Data transformation, cleansing, data alignment: Staging DB may be the workplace for the ETL to do the required cleansing, apply transformation rules and perform alignment of data from various systems.

Decoupling of Source from Datawarehouse: Staging DB helps decouple the Source systems from the DW. Source systems are accessed only by ETL jobs which pull data into staging, or they may even choose to push data there. This greatly reduces the amount of time that source systems are accessed by DW loading jobs, reducing contention too.

ETL Performance: Loading staging tables is simple as they are mostly the exact replicas of the source tables, minus the indexes / foreign key relationships. As most of the joins on the data , cleansing, calculations are done on staging tables and may be stored in intermediate temporary tables, loading of DW from staging tables is also simpler. This greatly increases the ETL performance as well as quality of the DW data.

Handle Failure, debugging, restart , recovery: As staging DBs have snapshots of source data, can have extensive logging, intermediate tables used for data transformations etc, they make it easier to debug errors/failures. In case of restarts / failures, the step of pulling the source data may not have to be redone.

Data reconciliation / pre / post load validations : This is very much similar to the point above. Staging DBs are excellent to capture auditing information which can be used for reconciling data, validating data to reject bad data, make sure the data we loaded is correct , consistent with the source, nothing is missed out.

Change Data Capture: Staging DB is also used as one of the mechanisms for CDC. For this the copy of source data tables from previous load are preserved till the next execution. The current and previous source data are compared to identify changes to the data. This is mostly used in scenarios where the source systems do not have means to support change detections.

Simpler, manintenable ETLs: With staging, we divide the process of Extract Transform Load into different processes. Combining the extraction, cleansing, transformation logic, precalculations, summarizations into one single ETL makes it very complex and also a single point for failure. With staging step, ETLs are smaller, simpler and easy to maintain as well.

Prevent Orphan records / race conditions: When we fetch data from source directly and load into DW, there are chances that we sometimes may pull some partial updates happening in the transactional system or the new data might be getting inserted as we pull causing race conditions. With Staging, the data from source is pulled in one single go preventing such problems.

Backup of source: The staging DBs if archived / retained for some specific period of time may act like a backup of the source data which is loaded into the DW.



Increases processing time , as data is first loaded into staging tables, transformed and then loaded to the DW. Due to the additional step, there is slight increase in processing time.
Increases latency, source changes take more time to reflect in DW. In realtime/near realtime DW, staging is not usually included.
Environment requirements : Additional Database servers / storage space required.

The overhead that staging may impose is quite less as compared to the benefits we derive. But then again, it completely depends on what suits a given requirement. The above points may however help us in making a choice.

Shraddha Tambe | Helical IT Solutions


1) http://data-warehouses.net/architecture/staging.html
2) http://dwbi.org/etl/etl/52-why-do-we-need-staging-area-during-etl-load
3) http://danlinstedt.com/allposts/datavaultcat/data-vault-and-staging-area/
4) http://www.dwh-club.com/dwh-bi-articles/change-data-capture-methods.html

About Analytical Databases

Analytics-iconAbout Analytical Databases 

Companies and Corporates that build Business Intelligence solutions for their business, often want to consolidate their data that is spread across various applications, for reporting and analysis purposes. For that, they build datawarehouses or data marts as the foundation of their enterprise-wide BI solutions. These DWs have been implemented using traditional relational databases and successfully so. But as organizations add more and more data and users to these systems, the scabiltity and performance become crucial. One of the useful contributors to solving the problems of scability / performance could be the use of Analytical Databases or Datawarehousing DBs.


What are these Analytical Databases ?

Analytics databases are optimized to be used for retrieval, aggregation and analytics. They deliver benefits in a BI environment because they are designed for high performance in this environment.

These Analytical databases, or analytical platforms, span a range of technology from appliances and columnar databases to shared nothing, massively parallel processing databases. The common thread among them is that most are read-only environments that deliver exceptional price/performance compared with general-purpose relational databases originally designed to run transaction processing applications. Analytical databases are now a central component in any BI ecosystem, providing the necessary performance and scalability to support large numbers of analytical users and growing volumes of both structured and unstructured data.

Below are a few differences between transactional and analytical DBs:

  • On-line transactional procesing (OLTP) databases are designed to host operational systems that run the business. Analytics databases step in when operational use of the data has ended, and the data are to be used for information and analysis.
  • OLTP-oriented databases, are designed for high performance in an on-line transactional environment, where additions, updates, and deletions are being continuously made to individual or small numbers of records in a random manner over time, to data scattered accross tables. In an analytics database, there are mostly bulk uploads of data, deletions and update are rare / infrequent.
  • Data integrity is critical to transactional data, OLTP databases are optimal for these.
  • OLTP databases support environments where many concurrent users are retrieving, adding, updating, and deleting data on realtime basis. Analytics databases, on the other hand, support environments where very many concurrent users are retrieving data but adds, updates, and deletes are done periodically, on scheduled basis.
  • In OLTP environments, retrievals are oriented towards single or few records where many or most columns from these records are required. In analystics databases, retrievals are oriented towards many or most records (statistical operations); few columns from these records are required.


What makes Analytical DBs faster and more scalable ?

Let’s look at some of the technologies I mentioned above to see how these make analytical DBs different. These are some but not all of the technologies that are used.

Columnar Databases :

A columnar database, also known as a column-oriented database, is a database management system (DBMS) that stores data in columns rather than in rows as relational DBMSs. The main difference between a columnar database and a traditional row-oriented database are centered around performance, storage necessities and schema modifying techniques. The goal of a columnar database is to efficiently write and read data to and from hard disk storage and speed up the time it takes to return a query. One of the main benefits of a columnar database is that data can be highly compressed allowing columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly. Another benefit is that because a column-based DBMS is self-indexing, it uses less disk space than a relational database management system (RDBMS) containing the same data.

Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.

Massively Parallel Processing:

MPP (massively parallel processing) is the coordinated processing of a program by multiple processors that work on different parts of the program, with each processor using its own operating system and memory. An MPP system is also known as a “loosely coupled” or “shared nothing” system. MPP allows the warehouse to break up large data analytics jobs into smaller, more manageable chunks, which are then distributed to multiple processors. Analytics is run simultaneously – or in parallel – on each processor, then the results returned and synthesized.

In Memory analytics:

I don’t think I really need to elaborate on this one. They are databaes which load the source data into system memory in a compressed, non-relational format in an attempt to streamline the work involved in processing queries.

Online analytical processing (OLAP):

These are the databases, which store multidimensional “cubes” of aggregated data for analyzing information based on multiple data attributes.


This is a little bit about analytical databases and how they might add value to a DW/BI Solution.

Shraddha TambeHelical IT Solutions

Slide share – determine right analytic db

Map Reduce in Mongo db :

Map Reduce in Mongo db :

This Blog will teach you, how to write Map reduce in Mongo DB .

Map Reduce is a concept that process large volume of data into aggregated results.

To use Map Reduce Concept in Mongo DB , create one command called “mapReduce”.

This mapReduce() function fetch data from collection (table) and then produce the result set into (key, value) format.

Then reduce () function takes the (key, value) pair and reduce all the data (documents) on the same key.

Eg : – Let say I have two Collection (tables) named :

  1. Emp_test
  2. Dept_Test

Now , to create collection in mongo db , use below query



To insert data in Emp_test Collection :

db.Emp_test.insert({“name” : {       “first” : “ABC”,       “last” : “DEF”   },   “city” : “Hyd”,   “department” : 1})

db.Emp_test.insert({“name” : {       “first” : “GHI”,       “last” : “JKL”   },   “city” : “Pune”,   “department” : 2})

To insert data in Dept_Test Collection :

db.Dept_Test.insert({“_id” : 1,   “department” : “SALESMAN”})

db.Dept_Test.insert({“_id” : 2,   “department” : “CLERK”})

Now the requirement is to display FirstName , LastName , DepartmentName.

For this , we need to use Map Reduce :

# 1 : Create two map functions for both the collections.

var mapEmp_test = function () {

var output= {departmentid : this.department,firstname:this.name.first, lastname:this.name.last , department:null}

emit(this.department, output);               };

var mapDept_Test = function () {

var output= {departmentid : this._id,firstname:null, lastname:null , department:this.department}

emit(this._id, output);               };

Write Reduce Logic to display the required fields :

var reduceF = function(key, values) {

var outs = {firstname:null, lastname:null , department:null};


if(outs.firstname ==null){                       outs.firstname = v.firstname                   }                   if(outs.lastname ==null){                       outs.lastname = v.lastname                   }                   if(outs.department ==null){                       outs.department = v.department                   }                          });   return outs;};

# 3 : Store the result into a different collection called emp_dept_test

result = db.employee_test.mapReduce(mapEmployee, reduceF, {out: {reduce: ’emp_dept_test’}}) result = db.department_test.mapReduce(mapDepartment,reduceF, {out: {reduce: ’emp_dept_test’}})

# 4: write the following commanddb.emp_dept_test.find()

Using Scriptlet in jasper :

This blog will teach you how to use Scriptlet in jasper report.

Scriptlets :

Scriptlets act as an interface between Jasper reports and java

Using Scriptlet we can use user defined functions in jasper reports .

Scriptlet is a java class that extend one of the two following classes :

  • JRDefaultScriptlet
  • JRAbstractScriptlet

Using JRAbstractScriptlet , we should implement all the abstract metods.

Using JRDefaultScriptlet , we only need to implement the methods which we need for our project.

Since Scriptlet is a java class , so to add a sciptlet(user defined class) in jasper reports ,

  1. first step is to create a java class implementing user requirement.
  2. Then create its jar file
  3. In the iReport , go to Tools à Options àiReport à classpath ,

Add the jar files created by you in this location

  1. Then in the Scriptlet node of iReport , add the Scriptlet Name and fully qualified name of the jar that earlier you have created .


Here , in the Scriptlet Class , Reporting_Time is my class name


When we are done with creating a new scriptlet under Scriptlet node in iReport , We can see a new parameter created under parameter node ,

Named as {Name_Of_Scriptlet}_SCRIPTLET

In my case it is “Report_SCRIPTLET”

Using the above parameter we can call any methods declared inside the class.



Rupam Bhardwaj

Exception in thread “main” java.lang.OutOfMemoryError: GC overhead limit exceeded

Different causes for OOM

Every one in java development face java.lang.OutOfMemoryError now and then, OutOfMemoryError in Java is one problem which is may be due to

  • Improper memory management by the Java code
  • Insufficient hardware memory available to process the code
  • Programming mistakes
  • Memory leak

Types of OOM

Any of the above reason can result into OutOfMemoryError. Following three different types of OOM errors are common

  1. Java.lang.OutOfMemoryError: Java heap space
  2. Java.lang.OutOfMemoryError: PermGen space
  3. Java.lang.OutOfMemoryError: GC overhead limit exceeded

OOM in Talend

From above the last error I observed during flat file data migration to database using Talend Open Studio. File sizes were in the range of 10 MB-500MB in size. Initially job worked well, but when started with larger files of around 100MB, this error popped up. Hardware memory (RAM) available was 16 GB. But default memory allocated by Talend was Xmx1024m (1GB). There was a possibility to increase Xmx to 10240m which could have solved the issue, but this GC overhead limit exceeded was related to garbage collection. After searching for the issue, I came across a very descriptive article related to garbage collection at http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#available_collectors


Following workaround solved the problem in Talend without increasing the memory limit to higher figures.

Add new option to Windows–>Preferences–>Talend–>Run/Debug  – XX:-UseGCOverheadLimit

Steps to migrate oracle with pentaho

Step 1:-

Run script as DB admin.

Script is available in location:- biserver-ce\data\oracle10g.

Modify configuration file:-

  1. applicationContext-spring-security-hibernate.properties.



original code:-






Modified code:-






  1. hibernate-settings.xml

Location:- pentaho-solutions\system\hibernate\hibernate-settings.xml.

Original code


Modified code:-




pentaho-solutions\system\hibernate\ oracle10g.hibernate.cfg.xml

Do not need to change any code in this file.. Just need to check everything is perfect or not


   <!– Oracle 10g Configuration –>

<property name=”connection.driver_class”>oracle.jdbc.driver.OracleDriver</property>

<property name=”connection.url”>jdbc:oracle:thin:@localhost:1521/sysdba


<property name=”dialect”>org.hibernate.dialect.Oracle10gDialect</property>

<property name=”connection.username”>hibuser</property>

<property name=”connection.password”>password</property>

<property name=”connection.pool_size”>10</property>

<property name=”show_sql”>false</property>

<property name=”hibernate.jdbc.use_streams_for_binary”>true</property>

<!– replaces DefinitionVersionManager –>

<property name=”hibernate.hbm2ddl.auto”>update</property>

<!– load resource from classpath –>

<mapping resource=”hibernate/oracle10g.hbm.xml” />


  1. quartz.properties:-



Original Code

org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.PostgreSQLDelegate

Modified Code:-

org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.oracle.OracleDelegate




Original Code

<Resource name=”jdbc/Hibernate” auth=”Container” type=”javax.sql.DataSource”

factory=”org.apache.commons.dbcp.BasicDataSourceFactory” maxActive=”20″ maxIdle=”5″

maxWait=”10000″ username=”hibuser” password=”password”

driverClassName=”org.hsqldb.jdbcDriver” url=”jdbc:hsqldb:hsql://localhost/hibernate

validationQuery=”select count(*) from INFORMATION_SCHEMA.SYSTEM_SEQUENCES” />


<Resource name=”jdbc/Quartz” auth=”Container” type=”javax.sql.DataSource”

factory=”org.apache.commons.dbcp.BasicDataSourceFactory” maxActive=”20″ maxIdle=”5″

maxWait=”10000″ username=”pentaho_user” password=”password”

driverClassName=”org.hsqldb.jdbcDriver” url=”jdbc:hsqldb:hsql://localhost/quartz

validationQuery=”select count(*) from INFORMATION_SCHEMA.SYSTEM_SEQUENCES”/>


Modified Code:-

<Resource validationQuery=”select 1 from dual”

url=” jdbc:oracle:thin:@localhost:1521/sysdba

driverClassName=”oracle.jdbc.OracleDriver” password=”password”

username=”hibuser” maxWait=”10000″ maxIdle=”5″ maxActive=”20″


type=”javax.sql.DataSource” auth=”Container” name=”jdbc/Hibernate”/>


<Resource validationQuery=”select 1 from dual”

url=” jdbc:oracle:thin:@localhost:1521/sysdba

driverClassName=”oracle.jdbc.OracleDriver” password=”password”

username=”quartz” maxWait=”10000″ maxIdle=”5″ maxActive=”20″


type=”javax.sql.DataSource” auth=”Container” name=”jdbc/Quartz”/>

6. repository.xml

Location of the file: pentaho-solutions\system\jackrabbit\repository.xml.

Comment this code means (<! – – every thing here – -> )

Active means: Remove comment

i)                    FileSystem part

Comment this code

<FileSystem class=”org.apache.jackrabbit.core.fs.local.LocalFileSystem”>

     <param name=”path” value=”${rep.home}/repository”/>


Active this code:-

<FileSystem class=”org.apache.jackrabbit.core.fs.db.OracleFileSystem”>

   <param name=”url” value=”jdbc:oracle:thin:@localhost:1521“/>

   <param name=”user” value=”jcr_user”/>

   <param name=”password” value=”password”/>

   <param name=”schemaObjectPrefix” value=”fs_repos_”/>

   <param name=”tablespace” value=”pentaho_tablespace”/>


ii)                  DataStore part

Comment this code

<DataStore class=”org.apache.jackrabbit.core.data.FileDataStore”/>

Active this code:-

<DataStore class=”org.apache.jackrabbit.core.data.db.DbDataStore”>

<param name=”url” value=”jdbc:oracle:thin:@localhost:1521/sysdba”/>

<param name=”driver” value=”oracle.jdbc.OracleDriver”/>

<param name=”user” value=”jcr_user”/>

<param name=”password” value=”password”/>

<param name=”databaseType” value=”oracle”/>

<param name=”minRecordLength” value=”1024″/>

<param name=”maxConnections” value=”3″/>

<param name=”copyWhenReading” value=”true”/>

<param name=”tablePrefix” value=””/>

<param name=”schemaObjectPrefix” value=”ds_repos_”/>


iii)                Security part in the FileSystem Workspace part

Comment this code:-

<FileSystem class=”org.apache.jackrabbit.core.fs.local.LocalFileSystem”>

<param name=”path” value=”${wsp.home}”/>


Active this code:-

<FileSystem class=”org.apache.jackrabbit.core.fs.db.OracleFileSystem”>

<param name=”url” value=”jdbc:oracle:[email protected]:1521/sysdba”/>

<param name=”user” value=”jcr_user”/>

<param name=”password” value=”password”/>

<param name=”schemaObjectPrefix” value=”fs_ws_”/>

<param name=”tablespace” value=”pentaho_tablespace”/>


iv)       PersistenceManager part

Comment this code:-

<PersistenceManager class=”org.apache.jackrabbit.core.persistence.pool.H2PersistenceManager”>

<param name=”url” value=”jdbc:h2:${wsp.home}/db”/>

<param name=”schemaObjectPrefix” value=”${wsp.name}_”/>


Active This Code:-

<PersistenceManager class=”org.apache.jackrabbit.core.persistence.bundle.OraclePersistenceManager”>

<param name=”url” value=”jdbc:oracle:thin:@localhost:1521/sysdba”/>

<param name=”driver” value=”oracle.jdbc.OracleDriver”/>

<param name=”user” value=”jcr_user”/>

<param name=”password” value=”password”/>

<param name=”schema” value=”oracle”/>

<param name=”schemaObjectPrefix” value=”${wsp.name}_pm_ws_”/>

<param name=”tablespace” value=”pentaho_tablespace”/>


v)       FileSystem Versioning part

Comment This Code:-

<FileSystem class=”org.apache.jackrabbit.core.fs.local.LocalFileSystem”>

<param name=”path” value=”${rep.home}/version” />


Active This Code:-

<PersistenceManager class=”org.apache.jackrabbit.core.persistence.bundle.OraclePersistenceManager”>

<param name=”url” value=”jdbc:oracle:thin:@localhost:1521/sysdba”/>

<param name=”driver” value=”oracle.jdbc.OracleDriver”/>

<param name=”user” value=”jcr_user”/>

<param name=”password” value=”password”/>

<param name=”schema” value=”oracle”/>

<param name=”schemaObjectPrefix” value=”pm_ver_”/>

<param name=”tablespace” value=”pentaho_tablespace”/>




Stopping HSQL db start up

In web.xml file

Comment or delete this code (Commenting is preferable)




<param-value>[email protected]/../data/hsqldb/sampledata,[email protected]/../data/hsqldb/hibernate,[email protected]/../data/hsqldb/quartz</param-value>




Also comment this code






You have done with integrating pentaho 5.0.1 CE with Oracle

Now login to the Pentaho server .

URL:  http://localhost:8080/pentaho

Username/Password : Admin/password

Using MDX Query in Pentaho CDE

This blog will talk about Using MDX Query in Pentaho CDE

In Layout Section:

Design the layout of your Dashboard, from scratch or from a template – while defining the layout you can apply styles and add HTML elements as text or images.

Pentaho CDE MDX Query

In Component Section:

In side Component layout we are using different types of chart and table or other things. According to component we will provide Data source name and Html Object and some other properties you can also use if you needed.

Pentaho CDE MDX Query

In Data source Panel:

In side data source panel we are we are using driver like jdbc or jndl according to requirement. In My example I am using Jndl data source and Mondrian Schema like Sample data and query like:

Select NON EMPTY {[Measures].[Actual]} ON COLUMNS,

NON EMPTY Crossjoin(Hierarchize(Union(Union(Crossjoin({[Region].[All Regions]},
{[Department].[All Departments]}), Crossjoin({[Region].[All Regions]},
[Department].[All Departments].Children)), Crossjoin([Region].[All Regions].Children,
{[Department].[All Departments]}))), {[Positions].[All Positions]}) ON ROWS

from [Quadrant Analysis]

Note: Below Image you can see the Process.

Pentaho CDE MDX Query

Preview Panel:

After Applying all those thing what I mentioned in above information the output will show like these. This output is same as OLAP creation output in MDX query. You can compare both output.

Pentaho CDE MDX Query

Pentaho CDE MDX Query

Note: When you are start working in MDX Query in Pentaho CDE. You have to first know how to use OLAP Kettle -MDX OLAP. After that you can use that MDX query inside Pentaho CDE (Pentaho Community Dashboard).


Helical IT Solutions