Loops in Pentaho Data Integration

Loops in PDI

 

If only there was a Loop Component in PDI *sigh*.

But we can achieve Looping Easily with the Help of few PDI Components. Let’s take a requirement of having to send mails. Now if the mails id and files to be sent where fixed we could just use a property file and send the mail id’s as a parameter.

but here mail ids will be coming in from a Table source and then mail id’s can never be the same.
So how did i approach this?

let’s say mail id’s are from Table: mail_address so now my aim here is to first add a row number to each record line and a Maxcount of all the records at each line, *Keep In Mind of the Maxcount*

and then pass it as a filter clause

where row_number = 1

now my assumption is the query should execute and return one row and then the column values like sender mail, receiver mail will be set into a variable and then passed to a Mail Transformation Component;

But how can we increase the value of the Where clause i.e how can we make

where row_number = 1 to

where row_number = 2

where row_number = 3 and so on.

 

So here is were Looping is required tp we take a variable and increment the value.

resultop0
1. Take a Set Variable at the Job Level [Initialize Loop] and then set a variable loop and assign the value to your initial value as shown below: In My case loop value = 1

resultop

 

Now next take a Transformation to get the variables and set the variables as shown below:

resultop1Now in my scenario I have a Modified JS because few conversions were required hence after that set the variable.

Next We need Transformation where we pass our variable of loop into our Query
as shown below

resultop3

 

SELECT * FROM (
SELECT *,ROW_NUMBER() OVER (Order by createfolderpath) as row_num,COUNT(dummy)           OVER (partition by dummy) as maxcount FROM (SELECT * FROM
(
SELECT DISTINCT replace(createfolderpath ,’/’,’\’) as createfolderpath,1 as dummy
FROM reportpaths
)base
)base2
Cross JOIN [TEST_SLSDMBKP].[dbo].[job_mailing_address]
WHERE
scheduled_flag = ‘Y’
)base3
WHERE
row_num = ${loop}
ORDER BY row_num,receivers_email,createfolderpath

Now here in my table input, 1 row will be the output so hence I pass the required columns into the Modified JS and then Set them as Variables.

resultop5

Next at our Job Level we use a JavaScript Component to increment the loop value to 2 as shown above.

Next at our Job Level we add a new transformation which does our requirement of Sending Mails as shown below:

resultop6

And we pass all the required variables into the Mail Component as shown below:resultop8

RECEVIERSEMAIL and all are values from our DB we Set as Variables.

After that we go to our JOB Level and we use a Simple Evaluation Component

To stop the Loop after a particular condition

resultop10

Now ${MAXCNT} is the  Maxcount of all the records at each line from our Query we sent as variable

Now once loop number is 6 and Maxcount is  5 then it stops the loop and ends else

It continues and goes to the next Component which is Wait Component which in a previous blog it was specified the importance of a WAIT Component.

So that’s it we went through looping in PDI.

Hence if we have any other approach, please feel free to drop it in the comment below

Thanks
Sohail Izebhijie

Analytic Query in Pentaho Data Integration

Analytic Query in Pentaho Data Integration [LAG & LEAD]

The could be a time we get a requirement as follows:

Fetch Every Purchases Sum for each Customer in a year, having the following columns

customerid, customername, productcost, purchasemonth

Now this looks easy, but the tricky part would be fetching the

PreviousPurchaseMonth

he/she made a purchase

If your using a PostgresDB then this will be easy because there are various windows functions that will help you in acquiring your result with ease

But if your DB doesn’t support Windows Functions then it could be a problem.

So now Pentaho will help you reach your result in very less hustle with the help of the Analytic Query Component

Now lets build our query for one customer

SELECT
sf97.customer_id as CustomerID,
co.fullname as CustomerName,
SUM(sf97.store_sales) as ProductCost,
tbd.the_month as PurchaseMonth
from sales_fact_1997 sf97
JOIN time_by_day tbd on tbd.time_id = sf97.time_id
JOIN customer co on co.customer_id = sf97.customer_id
WHERE sf97.customer_id = 4873
GROUP BY
sf97.customer_id,
co.fullname,
tbd.the_month,
tbd.month_of_year

ORDER BY sf97.customer_id,tbd.month_of_year

Now we Open Up our PDI,  Create a new Transformation,

  • Select Table Input as your Source Table
  • Next Select the Analytic Query

OpenPDI

  • Open up Analytic query
  • Now from our requirement we want to fetch the PreviousPurchaseMonth he/she made a purchase for each customer (What month he/she previously made a purchase)
    so the unique item here is the customerID
    Analytic
  • Then in the Group Field Select the customerid
  • In the Analytic Functions
    Enter the name of your new field
    Subject will be the purchasemonth Column because our new column is dependent on that

Type: This can be Lag (Behind) or Lead (Forward)
Click ok

Now right click on Analytic Query Component and Select Preview -> Quick Lunch

resultop

There you go.

Note make Sure in your query you have and order by based on your requirement.

Ehizogie Sohail Izebhijie

Alternative approach of using Insert/Update Step for Upsert in PDI

What is an UPSERT?

UPSERT is a combination of two activities in a table i.e. Update and Insert based upon a unique key(Unique iD). A Relational database uses MERGE JOIN to perform UPSERT operation on data where it updates if there is any change on the existing data and INSERTS if there is a new data based upon certain conditions (identified as unique ID).

How UPSERT works in an ETL?

While loading the data from the source to target, there are two prominent approach implemented. Namely,

  • Truncate and load: Flush out the old data and reload the new data.
  • Insert/Update (Upsert): Update the existing record and Insert if there is a new data. It matches on a particular key or a set of keys to perform Update/Insert. If the keys doesn’t match, then it is an INSERT or else it will update. SCD performs in the same way(SCD-I is an UPSERT whereas SCD-II is an INSERT).

Truncate and load approach are not recommended in some scenarios where the source data only has current data. In case of maintaining historical information as well, we use UPSERT mechanism.

UPSERT in Pentaho Data Integration(PDI) :

There are various components that are used to implement UPSERT in PDI. These components use the unique key as the keys for lookup from the target table and specific operation is performed based upon the condition. The following are the components used for Upsert

  1. Insert/Update(Insert when there is no match and update if there is match)
  2. Dimensional Lookup/Update(This is used to SCD Operations)
  3. Synchronise after Merge(This step needs a flag to execute. Based upon the flags it will insert/update)

The basic algorithm/flow of how the UPSERT functions in PDI is as follows,

Capture1

Disadvantages on the above steps :

Now the components that are present with PDI and the algorithms designed to implement the UPSERT is very slow while handling large amounts of data.

  1. The Insert/Update step does a lookup on all the values in the table and does one to one comparison to either insert or update. If the amount of data is huge, then the component will work very slowly and hence the whole process slows down.
  2. Similar case happens on the dimensional lookup/update and also on the Synchronise after merge step as well. Since it does a lookup on the whole table i.e. it compares with each and every data on the table, it slows down the process.

Alternative approach for the UPSERT technique:

There are certain scenarios where the ETL runs on a regular basis let’s say, on every 12 hours. In that case, the components used to implement the UPSERT technique wont help since they are very slow and if there are huge amounts of data, then its performance will be very slow. So there is a need of optimization.  Optimizing an ETL process  is one of the most important aspects that we need to undertake since it helps in faster processing. The following diagram will help you understand the alternate approach for UPSERT mechanism,

Capture2

This steps mentioned in the above diagram works exactly like the Insert/Update component but with some minor tweaks. The following are the description of each step,

  1. Table Input: Taking input from the Source data that needs to be loaded into the Target.
  2. Add Sequence: The Sequence number that needs to be added as a business key to the table(This can change according to your requirement).
  3. Update: This component works only when there is a need for Update. If there is a new data, this component will fail since it only works when there is a need to update on the existing data.
  4. Table Output: PDI has a functionality that supports error handling of each component. The error handling component of the Update step will bring you to the Table Output component which will INSERT the new data into the target table.

Please see the below diagram for more better understanding,

Capture3

Conclusion:

This approach is faster than the usual Insert/update step since it doesn’t need to lookup if there is a new data. It will directly insert the new data when the Update component fails.

Try once and share your feedback!! 🙂

 

Regards,

Nitish Mishra

Different ways of implementing loop in PDI

Different ways of implementing loop in PDI


We come across certain scenarios where in running of the Job/Transformation in loop becomes necessary in Pentaho. Below are the different ways of implementing the loop to the Job in PDI.

 

1. Using Execute for every input row

Build a job, and let the first transformation retrieve all the rows via Table Input. Use a “Copy rows to result” step to save the result. Create a second transformation configured for single row execution – refer the image advanced transformation settings. Use “Get rows from result” to start single row processing.

blog_3

 

 

2. Use simple evaluation step

blog_1

  • Get the number of count for the loop

  • Assign the count to a variable.

  • Call the variable in the simple evaluation step( as shown in the figure below)

    blog_2

  • Give the appropriate condition according to the requirement.

  • Connect the hop to the job which has to run in a loop.

Guide to Slowly Changing Dimensions [Intro + Type 1]

Guide to Slowly Changing Dimensions [Intro + Type 1]

Firstly what is a dimension?
A dimension is a structure that categorizes facts and measures which can be used to understand business requirements.

What is a Slowly Changing Dimension?
A Slowly Changing Dimension are dimensions that change slowly over time.

Why is it needed?
In Data Warehousing which deals with historical data, tracking of changes is important as it helps to give a better understanding of the Business.

An Example
Imagine having a customer dimension in a taxation department which holds records of address and so on,
and a customer changes his/her address it is important to track the address changes ie from old to new.

Types of Slowly Changing Dimension.
1. Type 1
2. Type 2
3. Type 3
4. Type 4
5. Type 5

1. Type 1:
This follows the technique of replacing the previous value, here no Historic data is/are kept, commonly used in correction or updating of records.

In this blog i would be showing you how to create a Type 1 Slowly Changing Dimension using Pentaho Data Integration:

Steps
1. Identify your source or create one if you have to.
CSV input

2. Load Data into the Database table
Load Data

Now Open a New transformation
3. Select a input component for your source in my case its a CSV input
4. Select a Get System Info input Component [Optional]
Double Click on it and
For Type: Select System Date (fixed)
For Name: load_data (Any Name)

5. If you are using a Get System info then this step is necessary
Select a Input Component Join Rows (Cartesian Product)
What this does is it makes a cross join of the new Load Date column to the Table or source table.
Select OK

6. Select a Select Values Input component,
this is used to add or remove or alter fields in the source before going to the target if required.

Note Make Sure all the components are connected as Shown below:
Before insert
Double Click on Select Values Component

Click on Get fields to select
Click on Meta-data and again select Get Fields to change
Click OK!

7. Select a insert/update Input component
Select your target Schema
Select your target Table

Now Select get fields
and remove the fields you don’t require while leaving the main key or primary key to look up values.
in my case client Id

Then

Select Get Update Fields
And in the update column i will want Select N for the column i don’t want to update.
in my case again “client_id”

Now if selected the Get System info component to have a load date column then the below step is important

Remember we added load_date we want add that column

so Select SQL
and

look through the pre-defined query available and you will see a Alter Table to add column load_date is present
so all we need to do is Select Execute and run a query to see if your column has been added as shown below

load_date added

Click OK!

8. Now Go to your Source and edit a data and add a data also as shown below

added_data
and then Save it

I have added a new row and also editied Sohail’s region from NY to NJ

Now your all transformation should something like this with all the conncetions pointing to the right component

after all

Now Run the Transformation

Now after running successfully

Check your data via query

final data

As you can see data has being replaced/overwrite for Sohail [NY to NJ]
and also added 1 more new row Data.

In My next blog we will discuss Type 2 of Slowly Changing Dimension

Thanks
Izebhijie Sohail Ehizogie

Handling Rollback and Database transactions in Pentaho Data Integration.

Handling Rollback and Database transactions in Pentaho Data Integration.


 

 

While inserting the data into the database table it often happens that when a job/transformation fails in between the data load the whole transaction gets rolled back.

In general, the table output step in PDI will insert and commit any number of records until there is an error. Consider an example where the ETL process processes 5000 records and the 3999 record has a string instead of an integer (eg. the DDL defines the field as Integer), the job/transformation will come to a sudden halt.

Now the problem is though, that we have 3,999 records in the DB table from this current ETL run.In some situations this might not be an issue, e.g. if you have a CDC process set up which can deal with this (e.g. it can just start off the highest id or timestamp), but in other scenarios, this situation might be a problem.

Pentaho Data Integration provides a transactional feature, which is really quite easy to use:Make sure your DB engine supports transactions! E.g. MySQL MyISAM does not support transactions. So the transformation will run and insert records even though there was an error.

Transactions in transformations: simply by enabling the “Make the transformation database transactional” option in the “Miscellaneous” tab of the transformation settings dialog.

grfedgf

Note: database transactions will not work in the following cases:

  • This will disable the Use batch update for insert option in the Table Output step also ignore the Commit Size setting. So effectively this will slow down the insert operation.
  • if you use database partitioning and multiple connections per step copy need to be created. Performance wise it doesn’t make any sense in that case anyway.

Mondrian Cache Refresh Automation Mechanism in Pentaho BA Server

REFRESH MONDRIAN CACHE ON THE BA SERVER

One of the strengths of mondrian’s design is that you don’t need to do any processing to populate special data structures before you start running OLAP queries. More than a few people have observed that this makes mondrian an excellent choice for ‘real-time OLAP’ — running multi-dimensional queries on a database which is constantly changing.

The Cache stores data in byte code format in memory which helps in great performance enhancements. However, when you have a cube with (near-)real-time data, the caching can get in your way. If, for instance, your data is updated every ten minutes, but the data from an hour ago is still cached, your dashboard won’t show your new data.

There are two steps to clear the cache from the server.

Step 1: After you login to the BA Server, Go to Tools –> Refresh. There you can manually clear the cache.

Step 2: There is another way of clearing the Mondrian Cache is by using a Kettle job to clear your cache after you complete the insertion/updating of the new data. First we have to create a transformation. Inside the transformation, using the HTTP Client Step we can clear the cache through the ETL(Before that just use a Generate Rows step since the HTTP step works if there is a step before it) by passing the following URL inside the step “localhost:8080/pentaho/api/system/refresh/modrianSchemaCache” and also the user credentials. It clears the Mondrian cache every time the ETL runs on the server. It clears cache for all the cubes that are present in the server so that it is ready for the fresh data and we wont face any cache related issue which has always been a headache for real time data.

Similarly, we can also use this mechanism to clear the cache for reporting metadata and also for the repository cache. We just have to follow the same procedure in this also.

 

Regards,

Nitish Kumar Mishra

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Connection Set-Up

Connection Setup for connection type: PostgreSQL

[We are now setting up a connection to A database i.e if your source is a database]
There are 3 ways of access provided for Connections using PostgreSQL
a. Native (JBDC)
b. ODBC
c. JNDI

a. Native (JBDC)
I. Enter the Connection Name: Anyname
II. Select the connection type: PostgreSQL
III. Host Name: localhost [This can be an ip address]
Database Name: LearnETL[Name of the database you are using]
Port Number: 5432 or your required Port Number
User Name: Your database user name
Password: Your database password
IV. Test Connections and OK.

b. JNDI
Here we need to go to the data-integration folder and open up the sub-folder “simple-jndi” and edit jdbc.properties

Here we need to write the following code:

ETL_Connection/type=javax.sql.DataSource
ETL_Connection/driver=org.postgresql.Driver
ETL_Connection/url=jdbc:postgresql://localhost:5432/LearnETL
ETL_Connection/user=postgres
ETL_Connection/password=postgres

ETL_Connection: name of the connection
localhost:5432/LearnETL: localhost is the host name, 5432 is the port number and LearnETL is the Database name.
user: username
password: Password

Save and back to the Database connection
Restart your PDI.

and in the Setup, select JNDI and for
JNDI Name : name of your connection [ETL_Connection]

c. ODBC

This is not commonly used but what will be needed

are as follows:

1.Install the PostgreSQL ODBC driver which can be downloaded.
2.Select the PostgreSQL ODBC Unicode and then
3. Setup
enter Data source Source, Name, Server, Username, Password and Port. Test and Save if Connection is OK.!
4. a bundle of JAR files to be copied in your Java folder as this ODBC bundle files has been discontinued in JAVA 8 Bridge

Thank You
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – A Basic Process

Loading Data from Source to Target

Before we proceed it’s important to identify the tool you might need to accomplish the process of ETL, in my case i would be using the Pentaho Data Integration Application (keep in mind irrespective of the tool you use the steps or procedures are similar but the approach might be differ).

The following Steps can be followed:

1. Identify your Source, it could be the following:
a. C.S.V file
b. Text file
c. A database
d. and so on

in my scenario a C.S.V (comma separated file)file.

2. Open Up your Spoon.bat
Select a new transformation and select a “Input” then select what type of input you require
we have Text File Input, C.S.V file input, SAS Input, Table Input and so on. In My case since i’ll be using C.S.V file as a Source i’ll select C.S.V file Input Component.

3. Set Up your Connection Based on your preferred connection type in my case i’ll be using the postgreSQL.

[Read my next Blog on setting Up a Connection using Connection type: PostgreSQL]

4. Once Connection has been Established you can Right-Click on the Connection and Select Share if that’s a common Connection to all your transformations will be using this will share the Transformation Connection to other transformation.

5. So we will be Sending data from Source to a Target to we need to to have a “Input” as the source
and an “Output” as the Target.

6. Input

download a C.S.V file from the internet

or even Create a TXT/C.S.V input file
as shown below.

Create a source if required
Text_Source (comma delimited)

Employee_Number,First_Name, Last_Name, Middle_Initial, Age, Gender, Title
101,Ehizogie,Izebhijie,Sohail,24,Male,Developer
102,Fahad,Anjum,WithClause,23,Male,Developer
103,Gayatri,Sharma,A,24,Female,Accountant

Save as txt or csv and this can be your input.

Here since our input is from a csv file
we open Up or C.S.V File Input component
Step Name: Anyname
File Name: Browse the selected path
Delimiter: , (comma)
Enclosure: ”

and then Select Get Fields and Select OK
Preview your Data

7. Output
Open Up Table output component
Select Target Schema
Select The Target Table

[Keeping in mind a Table exist in the DB]

Select OK!

Right-Click on the table output to MAP the columns from Source to Target

Now this is important as the to get the right data from the source to the column in the target
and Then Run

As a Beginner keep in mind that
Errors are a bound to occur
Such as Type of data from
Source Does not Match your Target Table format.
and so on.

Here a we can Have some little transformation step to convert and take care of such format errors
[In my next blog we can look into handling that]

Now Go to your Target Database and Run the SELECT * FROM table_name

there you go!

Thanks
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Introduction

Introduction into E.T.L (Extract, Transform and Load)
This a process related to data warehousing which involves the extracting of data out of the source system/Systems
and placing it into a repository or Target.

Extraction
Extracting the data from source systems (Flat Files or other operational systems) and converted into one consolidated data warehouse format which is ready for transformation.

Transformation
Transforming the data may involve the following tasks:

  • Cleaning: One of many very important task in the transforming stage because the Source data would always have data that the target system doesn’t support or understand hence cleaning is required.
    In some cases the Source can be from many source inputs so Lookup are important to avoid duplication.
  • Filtering: Now the Source Data would have so many rows but it’s important to send relevant data to your target and  filter out the unnecessary data.
  • Business Rules: Calculations or Derivations can be performed Here so we can have Correct and readable data at the target.

and many more.

Loading

After proper transformation and data matches the Business Rules loading the data into a target or repository is the final step in the E.T.L (Extract, Transform and Load)

in my next blog we will look into the basic Loading Data from Source to Target

Thanks
Sohail Izebhijie