Loops in Pentaho Data Integration

Loops in PDI

 

If only there was a Loop Component in PDI *sigh*.

But we can achieve Looping Easily with the Help of few PDI Components. Let’s take a requirement of having to send mails. Now if the mails id and files to be sent where fixed we could just use a property file and send the mail id’s as a parameter.

but here mail ids will be coming in from a Table source and then mail id’s can never be the same.
So how did i approach this?

let’s say mail id’s are from Table: mail_address so now my aim here is to first add a row number to each record line and a Maxcount of all the records at each line, *Keep In Mind of the Maxcount*

and then pass it as a filter clause

where row_number = 1

now my assumption is the query should execute and return one row and then the column values like sender mail, receiver mail will be set into a variable and then passed to a Mail Transformation Component;

But how can we increase the value of the Where clause i.e how can we make

where row_number = 1 to

where row_number = 2

where row_number = 3 and so on.

 

So here is were Looping is required tp we take a variable and increment the value.

resultop0
1. Take a Set Variable at the Job Level [Initialize Loop] and then set a variable loop and assign the value to your initial value as shown below: In My case loop value = 1

resultop

 

Now next take a Transformation to get the variables and set the variables as shown below:

resultop1Now in my scenario I have a Modified JS because few conversions were required hence after that set the variable.

Next We need Transformation where we pass our variable of loop into our Query
as shown below

resultop3

 

SELECT * FROM (
SELECT *,ROW_NUMBER() OVER (Order by createfolderpath) as row_num,COUNT(dummy)           OVER (partition by dummy) as maxcount FROM (SELECT * FROM
(
SELECT DISTINCT replace(createfolderpath ,’/’,’\’) as createfolderpath,1 as dummy
FROM reportpaths
)base
)base2
Cross JOIN [TEST_SLSDMBKP].[dbo].[job_mailing_address]
WHERE
scheduled_flag = ‘Y’
)base3
WHERE
row_num = ${loop}
ORDER BY row_num,receivers_email,createfolderpath

Now here in my table input, 1 row will be the output so hence I pass the required columns into the Modified JS and then Set them as Variables.

resultop5

Next at our Job Level we use a JavaScript Component to increment the loop value to 2 as shown above.

Next at our Job Level we add a new transformation which does our requirement of Sending Mails as shown below:

resultop6

And we pass all the required variables into the Mail Component as shown below:resultop8

RECEVIERSEMAIL and all are values from our DB we Set as Variables.

After that we go to our JOB Level and we use a Simple Evaluation Component

To stop the Loop after a particular condition

resultop10

Now ${MAXCNT} is the  Maxcount of all the records at each line from our Query we sent as variable

Now once loop number is 6 and Maxcount is  5 then it stops the loop and ends else

It continues and goes to the next Component which is Wait Component which in a previous blog it was specified the importance of a WAIT Component.

So that’s it we went through looping in PDI.

Hence if we have any other approach, please feel free to drop it in the comment below

Thanks
Sohail Izebhijie

Different ways of implementing loop in PDI

Different ways of implementing loop in PDI


We come across certain scenarios where in running of the Job/Transformation in loop becomes necessary in Pentaho. Below are the different ways of implementing the loop to the Job in PDI.

 

1. Using Execute for every input row

Build a job, and let the first transformation retrieve all the rows via Table Input. Use a “Copy rows to result” step to save the result. Create a second transformation configured for single row execution – refer the image advanced transformation settings. Use “Get rows from result” to start single row processing.

blog_3

 

 

2. Use simple evaluation step

blog_1

  • Get the number of count for the loop

  • Assign the count to a variable.

  • Call the variable in the simple evaluation step( as shown in the figure below)

    blog_2

  • Give the appropriate condition according to the requirement.

  • Connect the hop to the job which has to run in a loop.

Guide to Slowly Changing Dimensions [Intro + Type 1]

Guide to Slowly Changing Dimensions [Intro + Type 1]

Firstly what is a dimension?
A dimension is a structure that categorizes facts and measures which can be used to understand business requirements.

What is a Slowly Changing Dimension?
A Slowly Changing Dimension are dimensions that change slowly over time.

Why is it needed?
In Data Warehousing which deals with historical data, tracking of changes is important as it helps to give a better understanding of the Business.

An Example
Imagine having a customer dimension in a taxation department which holds records of address and so on,
and a customer changes his/her address it is important to track the address changes ie from old to new.

Types of Slowly Changing Dimension.
1. Type 1
2. Type 2
3. Type 3
4. Type 4
5. Type 5

1. Type 1:
This follows the technique of replacing the previous value, here no Historic data is/are kept, commonly used in correction or updating of records.

In this blog i would be showing you how to create a Type 1 Slowly Changing Dimension using Pentaho Data Integration:

Steps
1. Identify your source or create one if you have to.
CSV input

2. Load Data into the Database table
Load Data

Now Open a New transformation
3. Select a input component for your source in my case its a CSV input
4. Select a Get System Info input Component [Optional]
Double Click on it and
For Type: Select System Date (fixed)
For Name: load_data (Any Name)

5. If you are using a Get System info then this step is necessary
Select a Input Component Join Rows (Cartesian Product)
What this does is it makes a cross join of the new Load Date column to the Table or source table.
Select OK

6. Select a Select Values Input component,
this is used to add or remove or alter fields in the source before going to the target if required.

Note Make Sure all the components are connected as Shown below:
Before insert
Double Click on Select Values Component

Click on Get fields to select
Click on Meta-data and again select Get Fields to change
Click OK!

7. Select a insert/update Input component
Select your target Schema
Select your target Table

Now Select get fields
and remove the fields you don’t require while leaving the main key or primary key to look up values.
in my case client Id

Then

Select Get Update Fields
And in the update column i will want Select N for the column i don’t want to update.
in my case again “client_id”

Now if selected the Get System info component to have a load date column then the below step is important

Remember we added load_date we want add that column

so Select SQL
and

look through the pre-defined query available and you will see a Alter Table to add column load_date is present
so all we need to do is Select Execute and run a query to see if your column has been added as shown below

load_date added

Click OK!

8. Now Go to your Source and edit a data and add a data also as shown below

added_data
and then Save it

I have added a new row and also editied Sohail’s region from NY to NJ

Now your all transformation should something like this with all the conncetions pointing to the right component

after all

Now Run the Transformation

Now after running successfully

Check your data via query

final data

As you can see data has being replaced/overwrite for Sohail [NY to NJ]
and also added 1 more new row Data.

In My next blog we will discuss Type 2 of Slowly Changing Dimension

Thanks
Izebhijie Sohail Ehizogie

Handling Rollback and Database transactions in Pentaho Data Integration.

Handling Rollback and Database transactions in Pentaho Data Integration.


 

 

While inserting the data into the database table it often happens that when a job/transformation fails in between the data load the whole transaction gets rolled back.

In general, the table output step in PDI will insert and commit any number of records until there is an error. Consider an example where the ETL process processes 5000 records and the 3999 record has a string instead of an integer (eg. the DDL defines the field as Integer), the job/transformation will come to a sudden halt.

Now the problem is though, that we have 3,999 records in the DB table from this current ETL run.In some situations this might not be an issue, e.g. if you have a CDC process set up which can deal with this (e.g. it can just start off the highest id or timestamp), but in other scenarios, this situation might be a problem.

Pentaho Data Integration provides a transactional feature, which is really quite easy to use:Make sure your DB engine supports transactions! E.g. MySQL MyISAM does not support transactions. So the transformation will run and insert records even though there was an error.

Transactions in transformations: simply by enabling the “Make the transformation database transactional” option in the “Miscellaneous” tab of the transformation settings dialog.

grfedgf

Note: database transactions will not work in the following cases:

  • This will disable the Use batch update for insert option in the Table Output step also ignore the Commit Size setting. So effectively this will slow down the insert operation.
  • if you use database partitioning and multiple connections per step copy need to be created. Performance wise it doesn’t make any sense in that case anyway.

Beginner’s Guide to E.T.L (Extract, Transform and Load) – A Basic Process

Loading Data from Source to Target

Before we proceed it’s important to identify the tool you might need to accomplish the process of ETL, in my case i would be using the Pentaho Data Integration Application (keep in mind irrespective of the tool you use the steps or procedures are similar but the approach might be differ).

The following Steps can be followed:

1. Identify your Source, it could be the following:
a. C.S.V file
b. Text file
c. A database
d. and so on

in my scenario a C.S.V (comma separated file)file.

2. Open Up your Spoon.bat
Select a new transformation and select a “Input” then select what type of input you require
we have Text File Input, C.S.V file input, SAS Input, Table Input and so on. In My case since i’ll be using C.S.V file as a Source i’ll select C.S.V file Input Component.

3. Set Up your Connection Based on your preferred connection type in my case i’ll be using the postgreSQL.

[Read my next Blog on setting Up a Connection using Connection type: PostgreSQL]

4. Once Connection has been Established you can Right-Click on the Connection and Select Share if that’s a common Connection to all your transformations will be using this will share the Transformation Connection to other transformation.

5. So we will be Sending data from Source to a Target to we need to to have a “Input” as the source
and an “Output” as the Target.

6. Input

download a C.S.V file from the internet

or even Create a TXT/C.S.V input file
as shown below.

Create a source if required
Text_Source (comma delimited)

Employee_Number,First_Name, Last_Name, Middle_Initial, Age, Gender, Title
101,Ehizogie,Izebhijie,Sohail,24,Male,Developer
102,Fahad,Anjum,WithClause,23,Male,Developer
103,Gayatri,Sharma,A,24,Female,Accountant

Save as txt or csv and this can be your input.

Here since our input is from a csv file
we open Up or C.S.V File Input component
Step Name: Anyname
File Name: Browse the selected path
Delimiter: , (comma)
Enclosure: ”

and then Select Get Fields and Select OK
Preview your Data

7. Output
Open Up Table output component
Select Target Schema
Select The Target Table

[Keeping in mind a Table exist in the DB]

Select OK!

Right-Click on the table output to MAP the columns from Source to Target

Now this is important as the to get the right data from the source to the column in the target
and Then Run

As a Beginner keep in mind that
Errors are a bound to occur
Such as Type of data from
Source Does not Match your Target Table format.
and so on.

Here a we can Have some little transformation step to convert and take care of such format errors
[In my next blog we can look into handling that]

Now Go to your Target Database and Run the SELECT * FROM table_name

there you go!

Thanks
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Introduction

Introduction into E.T.L (Extract, Transform and Load)
This a process related to data warehousing which involves the extracting of data out of the source system/Systems
and placing it into a repository or Target.

Extraction
Extracting the data from source systems (Flat Files or other operational systems) and converted into one consolidated data warehouse format which is ready for transformation.

Transformation
Transforming the data may involve the following tasks:

  • Cleaning: One of many very important task in the transforming stage because the Source data would always have data that the target system doesn’t support or understand hence cleaning is required.
    In some cases the Source can be from many source inputs so Lookup are important to avoid duplication.
  • Filtering: Now the Source Data would have so many rows but it’s important to send relevant data to your target and  filter out the unnecessary data.
  • Business Rules: Calculations or Derivations can be performed Here so we can have Correct and readable data at the target.

and many more.

Loading

After proper transformation and data matches the Business Rules loading the data into a target or repository is the final step in the E.T.L (Extract, Transform and Load)

in my next blog we will look into the basic Loading Data from Source to Target

Thanks
Sohail Izebhijie

Merge Join Vs Stream lookup in Pentaho DI

Merge Join Vs Stream lookup in Pentaho DI

Merge Join:

It joins two data sets which are coming from two table Inputs. In this steps the below Types of Joins are available:

  • FULL OUTER: all rows from both sources will be included in the result, with empty values for non-matching keys in both data streams
  • LEFT OUTER: all rows from the first source will be in the result, with empty values for non-matching keys in the second data stream
  • RIGHT OUTER: all rows from the second source will be in the result, with empty values for non-matching keys in the first data stream
  • INNER JOIN: only rows having the same key in both sources will be included in the result

Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine.

 Stream Lookup:

The Stream lookup step type allows you to look up data using information coming from other steps in the transformation. The data coming from the Source step is first read into memory and is then used to look up data from the main stream.

Preserve memory:  Encodes rows of data to preserve memory while sorting. (Technical background: Kettle will store the lookup data as raw bytes in a custom storage object that uses a hashcode of the bytes as the key. More CPU cost related to calculating the hashcode, less memory needed.)

Key and value are exactly one integer field: Preserves memory while executing a sort by . Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Use sorted list” option.
(Technical background: The lookup data is stored in a custom storage object that is similar to the byte array hashmap, but it doesn’t have to convert to raw bytes. It just takes a hashcode of the long.)

Use sorted list: Enable to store values using a sorted list; this provides better memory usage when working with data sets containing wide row. Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Key and value are exactly one integer field” option. (Technical background: the lookup data is put into a tuple and stored in a sorted list. Lookups are done via a binary tree search.)

Thank you

Lalitha

Memory Group By Vs Group By in Pentaho DI

Memory Group By Vs Group By in Pentaho DI

Group By and Memory Group By: These steps are useful to calculate aggregate values for the group of fields.

Even though both are useful for same purpose there are some differences and similarities as follows:

1) Group By works  only when the input rows are in sorted order  otherwise it will gives accurate result for the sequence records. So before using the Group By step the result should be sorted. Where as Memory Group by will handle non sorted records also.internally it will sort out and gives the output based on the selected aggregate option.

2) In Group By, there is an option to include all rows. If you enable this option it will give all the records, not just the aggregation records. to differentiate these two type of rows on output , we have to provide one boolean type of flag column. In Memory Group By we do not have this option.

3) In Group By, we have Temporary Files Directory option. If we enable include all rows then this option will be available. This option is to select the directory in which the temporary files are stored (needed when the Include all rows option is enabled and the number or grouped rows exceed 5000 rows); the default is the standard temporary directory for the system. This is the option which allows us to store the data on disk so that if we have large amount of data we always prefer this one. And the Memory Group By step cannot handle huge amount of data as it always internally sorts out the data which will cause performance issue. So it will work fine for small amount of data only. So when the number of rows is too large to fit into memory, you need to use the combination of the Sort rows and Group by steps.

4) There is one more additional option in Group By step is, Add line number, restart in each group: Enable to add a line number that restarts at 1 in each group.

5) Always give back a result row : This option is same in both Group By and Memory Group By steps. If you enable this option, it will always give back a result row, even if there is no input row.  This can be useful if you want to count the number of rows.  Without this option you would never get a count of zero (0).

6) The field that make up the group: This option also works same in both the steps. After retrieving fields using the Get Fields button, designate the fields to include in the group.

7) Aggregates: Group By and Memory Group By , these both steps have same aggregate type functions. Specify the fields that must be aggregated, the method and the name of the resulting new field.
Here are the available aggregation methods :

  • Sum
  • Average (Mean)
  • Median
  • Percentile
  • Minimum
  • Maximum
  • Number of values (N)
  • Concatenate strings separated by , (comma)
  • First non-null value
  • Last non-null value
  • First value (including null)
  • Last value (including null)
  • Cumulative sum (all rows option only!)
  • Cumulative average (all rows option only!)
  • Standard deviation
  • Concatenate strings separated by <Value>: specify the separator in the Value column
  • Number of distinct values
  • Number of rows (without field argument)

Thank You

Lalitha

ETL Checklist

ETL Checklist

ETL Checklist is the things to know when you starting a ETL project.

Database Connections: Metadata and the jar files required for Source and Target Databases

Source data : We will have multiple sources and some times Single source also. We need to check is there any redundant data . We need to confirm regarding Data Types and data format. On which basis they are formatting data.

How the customer want to maintain the historical data. Is there any flags for deleted/invalid records. When to use SCD type 2 and SCD type 3 changes.

Logging and Auditing: We need to log all the details to track if any error occurs where it is exactly occurred. Because In our project we will have number of jobs and sub jobs.

So we have to track how many records are successfully inserted/updated

how many records got rejected. Job name, sub job name component name(if any error occurs we can track easily)

Rollback and Restarting Jobs: In ETL process the effected thing is always a Target database. So we need to keep check points in our job so that if any error occurs and results the job to be stopped. In this case ,when restarting the job the job should process from the last saving check point. So that no need to start the entire process again, which save the time and also improves the performance of the project.

Batch Loading: When we have a huge data, it is always better to load the data in multiple chunks rather than loading the entire data at a time. So Batch loading should be implemented.

Dependencies : In Target Database, we need to check for the dependencies like Lookup tables and Parent Tables which are connected with foreign key relationship. Are those having similar data types or not? Need to load the Parent tables first so that the foreign key relation will not get disturbed and also handled easily.

Alerting/Notifications: We have to implement Alerting/Notification mechanism. If any error occurs it should trigger a mail including error description,job name, component name etc.

Documentation: This is also one of the important task . Need to document the mappings in any format. These will be useful like a reference guide.

Thank you

Lalitha

 

Types of Facts in Data warehouse

Types of Facts in Data warehouse

What is Fact Table?

Fact Table consists of Measures/Metrics/Facts of  a business process. Typically it contains two types of data. Foreign keys which are holding the primary key of Dimension Tables and the second type is the sales,revenue, unit of product etc. measures.

fact

Measures Types:

Additive:  The measures which can be added across all the dimensions is known as Additive Measures.

Additive

In the above table , you can add the sales amount with respect to Product,Customer and also by Date.

Semi-Additive: The measures which can be added across some of the dimensions is known as Semi-Additive.

Semi-additive

Department wise we can calculate the total number of employees in that organization. But as per the Date you cannot say total employees in Sales department is 29+35=64 which is a wrong data. You can calculate total number of employees on 20150901 as 35+150=185. So the measures are possible only for some of the dimensions.

Non-Additive: The measures which cannot be added across any dimension is known as Non-Additive.

Non-Additive

We cannot calculate total profit margin with any of the dimension as shown in above table.

Fact less fact: This is not another type. It is like it may happen in real time that the fact will not store any measure kind of data. It is like set of events information. This is known as Fact less fact.

Factlessfact

In the above table, there is no fact, All the three columns derived from dimensions only. There is no facts in this scenario but we can retrieve the information how many students attended Java course on particular time etc.

Fact Types:

1)Transactional: The most common type of fact table, and each record is an event of transaction which will involve most of the dimensions. So it contains robust data, which enables maximum slicing and dicing of the data.

2)Periodic snapshots: The measurements occurring over a standard period, such as a day, a week, or a month. These fact tables are uniformly dense in their foreign keys because even if no activity takes place during the period, a row is typically inserted in the fact table containing a zero or null for each fact.

3)Accumulating snapshots : The measurement events occurring at predictable steps between the beginning and the end of a process. e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process.

Thank You

Lalitha