Looping and batch processing in PDI

Implementing Loops and batch processing in Kettle

What is batch processing?
Batch processing is basically a series of execution of jobs from source to destination. Here the data is divided into smaller batches and transferred to the destination. This helps in faster execution and also helps in memory management. This is a popular technique used in ETL processes where the data is generally very large.

Creating loops in PDI:
Lets say suppose you want to implement a for loop in PDI where you want to send 10 lakhs of records in batches of 100. Looping technique is complicated in PDI because it can only be implemented in jobs not in the transformation as kettle doesnt allow loops in transformations.
So inorder to implement loops in PDI, you need a main job which will call the subjob where the loops are desgined and the transformation which contains the logic to send data from source to destination.

The below is the overall design of the subjob which will help in batch processing and in looping.

Pic1

 

NrBatches: This is the transformation which will process the batch where it will count the total data present in the table and based on that count it will divide into various batches. Kindly check the below screenshots for more details

NrBatches Transformation design:

Pic2

 

The table input component is used to generate the count and min value of primary key(ID as primary key) from the source table. The Modified javascript value component is used to assign the batch size and to generate the no of batches using batch count(which is count value of the table) and the batch size. Kindly look into the below screenshot for better understanding.

Pic3

 

The set variables component present in the NrBatches transformation is used to set the values of batch size, batch count and no of batches so that these can be in other transformations.

Table Exists: This step is used to check whether the destination table is present or not. If its not present it will execute the SQL statement to create the table and then it goes into the next step. If it exists then it goes to the next step.

Set variables: The set variables component present in the Jobs is used to set the incrementer value(Just like i component in for loop). The initial value of the incrementor is set here. Kindly refer the below screenshot for more clarification.

Pic4

 

Simple Evaluation:

This step is equivalent to the conditional statement where the iterator is checked against the success condition(much like the If conditional statement where lets say for example if(i<50) or while(i< 50)). In this step, the variable set in the previous step is checked against the total no of batches(created in NrBatches Transformation). Kindly check the below screenshot for more clarification.

Pic5

 Transformation Step:

Inside the transformation, set the flags i.e the start flags and end flags using Javascript component. Please see the below screenshot for more clarification on how to set the flags. Alternatively you can also use the formula step or the calculator step which is a better option if you have large amount of data.

.javascript

Pic6

 

 

 

 

 

 

 

Increment Step:

Now since we are using a loop, there has to be an increment. Here we used the javascript which will increment the batch_no(see Set variables step)and thereby the looping mechanism gets completed. Here also you can use another transformation instead of Javascript step which helps in faster processing and consumes lesser memory. Please see the below screenshot for more understanding.

-Javascript Step:

Pic7

-Using Another Transformation:

BatchProcessIncrement

The formula step is used to increment the batch no. See the below screenshot for more clarification.

Formula

So this is how the looping works in Kettle. But there are some limitation to this. This looping technique consumes a lot of memory space if the number of iterations is very high. Use this technique only for problems where the number of loop executions is very low. It will help in enhacing the performance.

Thanks,

Nitish

 

Change Data Capture (CDC) – Methods, Approaches, Alternatives

Change data capture :

It’s the process of tracking the data that has changed so that corresponding action can be taken. Its nothing but capturing the changes which are made on the data source and applying it at all the other places, and hence it reduces the efforts required at the ETL step.

 

Different strategies :

  • Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process.
  • Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model.

 

Usage of Change Data Capture:

Change data capture is often used in data warehousing since it involves extraction and transformation of data from one or more databases to data warehouse for analysis.

Prior to CDC, the change data was captured using other methodologies i.e. table difference or table comparison. Its like taking a copy of the source and comparing with target, spotting the difference. For limited number of records this approach might work, but for more number of records this methodology will be highly inefficient and result in immense overheads.

 

Table differencing: It involved copying entire table from source to staging. But since, the entire table was shifted, there were additional overheads and this method was not efficient. Also, in case if there are multiple changes in the data, it cant be captured using this methodology.

 

Change value selection: This involves capturing source database changed values based on the value of a specific colum. However for implementing this we need to fire queries to find out the changed data which is an additional overhead. Again, in this method also we will not be able to find or capture the intermediate values. Also the changes happening between firing of two queries will not be captured.

 

A slightly better approach than the ones mentioned above is by using database triggers. This triggers can be used after any Insert/Delete/Modify statement whenever there is any change, the changed information will be passed on to a table. The only disadvantage of this solution is it is database specific and migration of such a solution becomes complex since its db specific.

 

Another approach would be the usage of log readers. Many Enterprise DBMS have capability of transaction logs that can be used (can be switched on and off), the amount & details of information in log to be stored can be decided as well. By reading this we can come to know of the changes happening in the system. The problem with this is purchasing of new component for log reading, migration might be an issue, also some DBs might not have the logging features etc. Also sometimes some changes which are not committed, their logs are also created which could be as well an issue.

 

Capturing data with CDC:

CDC can capture data using two modes :: Synchronous and Asynchronous. In synchronous mode, any changed data in source system is informed immediately to the target using DML (data manipulation language) operations, whereas that’s not the case with Asynchronous mode. Synchronous uses triggers and it results in no latency or delay, though it has more overheads as compared to asynchronous.  
Benefits of CDC

–          CDC can improve performance and reduce redundancy by copying only the changed data.

–          Log based CDC also reduces load on operational systems since it only reads the logs rather than db

–          No latency, hence end business users can have immediate realization of the same in their BI

–          CDC can easily handle high volume transactions

 

We can implement CDC using Talend Integration Suite. More information about the same will be covered in another blog.

 

For any query, please reach out to me on [email protected]

Date Input Controls/ Components in Pentaho CDE with a working example

Hello guys…!!!

In this post you will learn about how to create date input controls in CDE.
The dashboard images which I am going to show below are having lots of .css code. I’m not covering all the things but concentrating only on Date input controls in CDE.

My Environment :
BI Server: Pentaho 4.5
CTools: Not found the version numbers of  C-Tools(Unable to check them)– perhaps..! lesser than 13
Database : hsqldb (Comes along with Pentaho installation — A java based small db for servers like pentaho)
NOTE: Images are taken in this post after creating the dashboard, please find the yellow rectangular boxes to understand.

You need to work out @ 3 places in “Components” section.
1. Generic
2. Selects
3. Charts
Explained from step 1 to step 3
You need to work out  @ 1 places in “Data Sources” section. i.e, all in “Properties”
I’m not concentrating on Layout part for this example. So wherever the HtmlObject you find in this example replace them with your working HtmlObjects.

Step 1: Creating parameters
* Design your dashboard as per your requirement ( In the images you can find the sample designs).
 In the “Components” section:
1. Click on components
2. Click on Generic
3. Click on Date parameter
4. Go to properties give the name
eg: param_start_date
5. Click + symbol  per another date parameter
6. Go to properties give the name
eg: param_end_date

Find the  image

Step 2: Creating Date input Components
In the “Components” section
1. Click on Components
2. Click on Selects
3. Click on “Date input Component”
4. Go to Properties give name. eg: start_date in  my example
5. In the properties give HtmlObject. eg: start_date_select
* HtmlObject is the name of the column where you are putting
your dateinput control.
6. Give  Parameter.[It will appear once you try to type]
eg: In step 1, created parameters. So here give “param_start_date”
7. Give Listeners[click on it, you will find the list of Listeners. Click OK]
8. Repeat 3-7 for “end_date” Date input
component.

Step 3:
Find the steps in image.

 
HtmlObject is the place where you are putting your chart.

Step 4: Applying parameters for Chart & Query
            In the “DataSources”
Find the steps in the image
Find the sub sequent image for  Parameters applying for chart & Query.

do same as shown in images.

Step 5:
Save the dashboard and then Click on “Preview” button.
See the images below for sample out put with date input controls.
Preview of the dashboard with out any input controls selection

Preview of the dashboard after selecting date input controls.. find the images below one by one.

Find the image for observing whether the chart becomes dynamic or not with different date input controls.

That’s it.. You are done with date input controls on dashboard.
Thank you for reading this small post.

Sadakar
[email protected]
Helical IT Solutions Pvt. Ltd,
Hyderabad.

Hire Pentaho Consultants / Hire Pentaho Developers

Hire Pentaho Consultants / Hire Pentaho Developers

Helical IT Solutions, with its deep expertise on Pentaho and being an open source DW BI expert, can help in designing and constructing your BI solution. Helical IT solutions is having experience on end to end Pentaho BI suite right from ETL, data warehousing, C-tools, CDE, CDF, Pentaho Report Designer, Pentaho Schema Workbench, data mining etc. Our in-depth knowledge of both Business Intelligence applications and the Pentaho platform to ensure successful development and deployment of your BI initiatives.

Pentaho Consultant

Helical can help at a number of aspects like

– BI Tool Selection : Based on your requirement, with our deep rooted expertise on a number of BI solutions, we can help you in determining which BI tools would be best suited for you. Whether you actually need to purchase enterprise edition or if your requirement can be fulfilled via community edition by using some tweaks or custom coding or best of breed solution. The tool selection will be done after a thorough analysis of your requirements, present hardware and software, budget, speed of solution delivery required etc.

– Pentaho BI POC :- If your company is still in dilemma to go ahead or not, then we can help you in developing a POC (proof of concept). We help evaluate the open source BI and ETL solution for your environment, demonstrate how it will work, and recommend how you can leverage the technology, even if you have other BI products.

– Pentaho BI Solution Development :- With our technical expertise and domain knowledge of end to end BI solution development in a number of different verticals ( we have executed BI solutions in energy, healthcare, insurance, supply chain, e-commerce, human resource), we can help you in designing and developing the perfect BI solution for you. We will help you in the right KPI parameters selection, reports and dashboards development, OLAP cubes, ETL scripts generation, plugin designing, security implementation, application integration, fetching data etc. For designing the solution for you, we can use any of the below mentioned tool present in the BI suite.

  • Pentaho Business Intelligence Platform
  • Pentaho Data Integration (Kettle)
  • Pentaho Analysis Services (Mondrian)
  • Pentaho Analytics – Agile BI (commercial), Saiku (open source)
  • Pentaho Reporting
  • Pentaho Data Mining (Weka)
  • Pentaho Dashboards

– Pentaho Data Integrator (PDI) – Kettle :- Pentaho BI comes with a powerful ETL (Extract, Transform, Load) suite that allows your existing data to be transformed, summarized and aggregated into a form that puts the business information at your fingertips. At Helical, we have extensive ETL knowledge, employing it both for Business Intelligence purposes and to transform and migrate data between systems, software’s & databases. Having a lot of experience on data warehousing, ETL, data modelling, data mart designing, query optimization etc, our dedicated team can also help you with all of your ETL requirement, creating transformations and jobs etc.

– Training and Documentation :- A software is as good as its end user. In order to make sure that the end user is able to use the developed software pretty well, we will be providing all the necessary training to the end user. Also, the same would be supported by documentation like installation document, troubleshooting document etc.

– Support :- Helical team will be providing 24×7 support to the client, thus ensuring that the solution is up and running always.

Helical provides support and services on the entire Pentaho BI stack, as mentioned below

  • Pentaho Business Intelligence Platform
  • Pentaho Data Integration (Kettle)
  • Pentaho Analysis Services (Mondrian)
  • Pentaho Analytics – Agile BI (commercial), Saiku (open source)
  • Pentaho Reporting
  • Pentaho Data Mining (Weka)
  • Pentaho Dashboards
  • Metadata design and development
  • Performance tuning
  • Application integration

Contact us at [email protected], (+91-7893947676) Pentaho developer