Loops in Pentaho Data Integration

Loops in PDI

 

If only there was a Loop Component in PDI *sigh*.

But we can achieve Looping Easily with the Help of few PDI Components. Let’s take a requirement of having to send mails. Now if the mails id and files to be sent where fixed we could just use a property file and send the mail id’s as a parameter.

but here mail ids will be coming in from a Table source and then mail id’s can never be the same.
So how did i approach this?

let’s say mail id’s are from Table: mail_address so now my aim here is to first add a row number to each record line and a Maxcount of all the records at each line, *Keep In Mind of the Maxcount*

and then pass it as a filter clause

where row_number = 1

now my assumption is the query should execute and return one row and then the column values like sender mail, receiver mail will be set into a variable and then passed to a Mail Transformation Component;

But how can we increase the value of the Where clause i.e how can we make

where row_number = 1 to

where row_number = 2

where row_number = 3 and so on.

 

So here is were Looping is required tp we take a variable and increment the value.

resultop0
1. Take a Set Variable at the Job Level [Initialize Loop] and then set a variable loop and assign the value to your initial value as shown below: In My case loop value = 1

resultop

 

Now next take a Transformation to get the variables and set the variables as shown below:

resultop1Now in my scenario I have a Modified JS because few conversions were required hence after that set the variable.

Next We need Transformation where we pass our variable of loop into our Query
as shown below

resultop3

 

SELECT * FROM (
SELECT *,ROW_NUMBER() OVER (Order by createfolderpath) as row_num,COUNT(dummy)           OVER (partition by dummy) as maxcount FROM (SELECT * FROM
(
SELECT DISTINCT replace(createfolderpath ,’/’,’\’) as createfolderpath,1 as dummy
FROM reportpaths
)base
)base2
Cross JOIN [TEST_SLSDMBKP].[dbo].[job_mailing_address]
WHERE
scheduled_flag = ‘Y’
)base3
WHERE
row_num = ${loop}
ORDER BY row_num,receivers_email,createfolderpath

Now here in my table input, 1 row will be the output so hence I pass the required columns into the Modified JS and then Set them as Variables.

resultop5

Next at our Job Level we use a JavaScript Component to increment the loop value to 2 as shown above.

Next at our Job Level we add a new transformation which does our requirement of Sending Mails as shown below:

resultop6

And we pass all the required variables into the Mail Component as shown below:resultop8

RECEVIERSEMAIL and all are values from our DB we Set as Variables.

After that we go to our JOB Level and we use a Simple Evaluation Component

To stop the Loop after a particular condition

resultop10

Now ${MAXCNT} is the  Maxcount of all the records at each line from our Query we sent as variable

Now once loop number is 6 and Maxcount is  5 then it stops the loop and ends else

It continues and goes to the next Component which is Wait Component which in a previous blog it was specified the importance of a WAIT Component.

So that’s it we went through looping in PDI.

Hence if we have any other approach, please feel free to drop it in the comment below

Thanks
Sohail Izebhijie

Analytic Query in Pentaho Data Integration

Analytic Query in Pentaho Data Integration [LAG & LEAD]

The could be a time we get a requirement as follows:

Fetch Every Purchases Sum for each Customer in a year, having the following columns

customerid, customername, productcost, purchasemonth

Now this looks easy, but the tricky part would be fetching the

PreviousPurchaseMonth

he/she made a purchase

If your using a PostgresDB then this will be easy because there are various windows functions that will help you in acquiring your result with ease

But if your DB doesn’t support Windows Functions then it could be a problem.

So now Pentaho will help you reach your result in very less hustle with the help of the Analytic Query Component

Now lets build our query for one customer

SELECT
sf97.customer_id as CustomerID,
co.fullname as CustomerName,
SUM(sf97.store_sales) as ProductCost,
tbd.the_month as PurchaseMonth
from sales_fact_1997 sf97
JOIN time_by_day tbd on tbd.time_id = sf97.time_id
JOIN customer co on co.customer_id = sf97.customer_id
WHERE sf97.customer_id = 4873
GROUP BY
sf97.customer_id,
co.fullname,
tbd.the_month,
tbd.month_of_year

ORDER BY sf97.customer_id,tbd.month_of_year

Now we Open Up our PDI,  Create a new Transformation,

  • Select Table Input as your Source Table
  • Next Select the Analytic Query

OpenPDI

  • Open up Analytic query
  • Now from our requirement we want to fetch the PreviousPurchaseMonth he/she made a purchase for each customer (What month he/she previously made a purchase)
    so the unique item here is the customerID
    Analytic
  • Then in the Group Field Select the customerid
  • In the Analytic Functions
    Enter the name of your new field
    Subject will be the purchasemonth Column because our new column is dependent on that

Type: This can be Lag (Behind) or Lead (Forward)
Click ok

Now right click on Analytic Query Component and Select Preview -> Quick Lunch

resultop

There you go.

Note make Sure in your query you have and order by based on your requirement.

Ehizogie Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Introduction

Introduction into E.T.L (Extract, Transform and Load)
This a process related to data warehousing which involves the extracting of data out of the source system/Systems
and placing it into a repository or Target.

Extraction
Extracting the data from source systems (Flat Files or other operational systems) and converted into one consolidated data warehouse format which is ready for transformation.

Transformation
Transforming the data may involve the following tasks:

  • Cleaning: One of many very important task in the transforming stage because the Source data would always have data that the target system doesn’t support or understand hence cleaning is required.
    In some cases the Source can be from many source inputs so Lookup are important to avoid duplication.
  • Filtering: Now the Source Data would have so many rows but it’s important to send relevant data to your target and  filter out the unnecessary data.
  • Business Rules: Calculations or Derivations can be performed Here so we can have Correct and readable data at the target.

and many more.

Loading

After proper transformation and data matches the Business Rules loading the data into a target or repository is the final step in the E.T.L (Extract, Transform and Load)

in my next blog we will look into the basic Loading Data from Source to Target

Thanks
Sohail Izebhijie

Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 1

Passing parameters from parent job to sub-job/transformation in Pentaho Data Integration (Kettle) – Part 1

Part -1 : Using setter and getter methods :

1.Set variables in parent job/transformation.
2.Get variables in sub-job/transformation.

1.Set variables :

Identify the field names that you are going to set using set variables step and assign each with a proper variable name.

And also define the scope of the variable with following possible options.

Variable Scope type :

1.Valid in the virtual machine: the complete virtual machine will know about this variable.
Warning: this makes your transformation only fit to run in a stand-alone fashion.
Running on an application server like on the Pentaho framework can become a problem.
That is because other transformations running on the server will also see the changes this step makes.

2.Valid in the parent job: the variable is only valid in the parent job.
3.Valid in the grand-parent job: the variable is valid in the grand-parent job and all the child jobs and transformations.
4.Valid in the root job: the variable is valid in the root job and all the child jobs and transformations.

Configure set variable step like below.

set_variables_pdi

But please note that, IT IS NOT POSSIBLE TO SET AND USE A VARIABLE IN THE SAME TRANSFORMATION.This is because all steps run in parallel.

2.Get variables :

Use get variables step in sub-job/transformation to get the same information from the parent job/transformation.

But you need to make sure that you have specified the variable name in a correct format like ${variable}
or %%variable%% (as described in Variables). That means you can also enter complete strings in the variable column, not just a variable.

Configure get variables step like below.

get_variables_pdi

So now you can utilize the same variables in your sub-job/transformation wherever required.

Apart from this,we can also pass all parameters down to sub-job/transformation using job / transformation executor steps.

Please follow my next blog for part 2 :
Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 2,

Thanks,

Sayagoud

Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 2

Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 2

In part 1, I have mentioned about passing parameter values down to sub-job/transformation using setter/getter methods.

Please refer my previous post for part 1 Passing parameters from parent job to sub job/transformation in Pentaho Data Integration (Kettle) -Part 1

But in this part we will use executor steps to do the same process.

Part 2 : Using job/transformation executor steps :

In order to pass the parameters from the main job to sub-job/transformation,we will use job/transformation executor steps depends upon the requirement.

Following are the steps :

1.Define variables in job properties section
2.Define variables in tranformation properties section
3.Configure job/transformation executor step

1.Define variables in job properties section :

Right-click any where on your job and select settings and go to parameters section.Define all the variables and assign some default values to each.

job_properties_pdi

2.Define variables in transformation properties section :

Right-click any where on your sub job/transformation and select settings and go to parameters section.
Use the same variables that you have defined in your parent job (i.e.Step1) and assign some default values to each.
transform_proeprties_pdi
3.Configure job/transformation executor step :

In this step,we have to configure sub job/transformation path details and need to pass the same parameters in job entry details section like below.

Double click on job/transformation executor step and provide transformation file path details.

transform_executor_config

Go to parameters section and make sure you have checked the Pass all parameter values down to the sub-transformation check box.

transform_executor_params

So now you can utilize same variables in your sub-transformation.

 

Thanks,

Sayagoud