Beginner’s Guide to E.T.L (Extract, Transform and Load) – A Basic Process

Loading Data from Source to Target

Before we proceed it’s important to identify the tool you might need to accomplish the process of ETL, in my case i would be using the Pentaho Data Integration Application (keep in mind irrespective of the tool you use the steps or procedures are similar but the approach might be differ).

The following Steps can be followed:

1. Identify your Source, it could be the following:
a. C.S.V file
b. Text file
c. A database
d. and so on

in my scenario a C.S.V (comma separated file)file.

2. Open Up your Spoon.bat
Select a new transformation and select a “Input” then select what type of input you require
we have Text File Input, C.S.V file input, SAS Input, Table Input and so on. In My case since i’ll be using C.S.V file as a Source i’ll select C.S.V file Input Component.

3. Set Up your Connection Based on your preferred connection type in my case i’ll be using the postgreSQL.

[Read my next Blog on setting Up a Connection using Connection type: PostgreSQL]

4. Once Connection has been Established you can Right-Click on the Connection and Select Share if that’s a common Connection to all your transformations will be using this will share the Transformation Connection to other transformation.

5. So we will be Sending data from Source to a Target to we need to to have a “Input” as the source
and an “Output” as the Target.

6. Input

download a C.S.V file from the internet

or even Create a TXT/C.S.V input file
as shown below.

Create a source if required
Text_Source (comma delimited)

Employee_Number,First_Name, Last_Name, Middle_Initial, Age, Gender, Title
101,Ehizogie,Izebhijie,Sohail,24,Male,Developer
102,Fahad,Anjum,WithClause,23,Male,Developer
103,Gayatri,Sharma,A,24,Female,Accountant

Save as txt or csv and this can be your input.

Here since our input is from a csv file
we open Up or C.S.V File Input component
Step Name: Anyname
File Name: Browse the selected path
Delimiter: , (comma)
Enclosure: ”

and then Select Get Fields and Select OK
Preview your Data

7. Output
Open Up Table output component
Select Target Schema
Select The Target Table

[Keeping in mind a Table exist in the DB]

Select OK!

Right-Click on the table output to MAP the columns from Source to Target

Now this is important as the to get the right data from the source to the column in the target
and Then Run

As a Beginner keep in mind that
Errors are a bound to occur
Such as Type of data from
Source Does not Match your Target Table format.
and so on.

Here a we can Have some little transformation step to convert and take care of such format errors
[In my next blog we can look into handling that]

Now Go to your Target Database and Run the SELECT * FROM table_name

there you go!

Thanks
Sohail Izebhijie

Talend Open Studio (TOS) versus Pentaho Data Integrator (Kettle)

This blog will be talking about Talend Open Studio (TOS) versus Pentaho Data Integrator (Kettle)

Talend Open Studio (TOS) and Pentaho Data Integration (Kettle) are two comprehensive and widely used Open Source ETL tools.

Pentaho Kettle

  1. The company started around 2001 (2002 was when kettle was integrated into it).
  2. It has a stand-alone java engine that processes the jobs and tasks for moving data between many different databases and files.
  3. It can schedule tasks (but you need a scheduler for that – cron).
  4. It can run remote jobs on “slave servers” on other machines.
  5. It has data quality features: from its own GUI, writing more customised SQL queries, Java script and regular expressions.
  6. (Spoon) it is easy to build Data Integration procedures. The procedures can be run by Kettle runtime in different ways: Using the command line utility (Pan), a small server (Carte), a database repository (Kitchen) or directly from the IDE (Spoon). Procedures are saved in XML files and interpreted by a Java library which is required to run the ETL tasks.
  7. Kettle is an interpreter of ELT procedures saved in XML format.
  8. Kettle comes with a graphical tool which is very intuitive and helps the entire ETL process from the design to test and deployment.
  9. Kettle IDE is slightly easier to start with but also less comprehensive
  10. Kettle is more flexible on this and the ETL procedures can be built quickly.
  11. Pentaho Data Integration (Kettle) is Java (Swing) application and library. Kettle is an interpreter of procedures written in XML format. The features and components are a little less compressive than Talend ones, however this doesn’t restrict the complexity of the ETL procedures that can be implemented. Kettle provides a JavaScript engine (as well as a Java one) to fine tune the data manipulation process.
  12. Kettle is also a good tool, with everything necessary to build even complex ETL procedures.
  13. Kettle is an interpreter of ETL procedures written in XML format. Kettle provides a Java or JavaScript engine to take control of data processing.
  14. Kettle (PDI) is the default tool in Pentaho Business Intelligence Suite. The procedures can be also executed outside the Pentaho platform, provided that all the Kettle libraries and Java interpreter are installed.
  15. Kettle makes it is easy to deploy procedures in clustered environments and save them in a database table.

Talend TOS

  1. It started around October 2006
  2. It has a much smaller community then Pentaho but has 2 finance companies supporting it.
  3. It generates java or perl code which you can later run on your server.
  4. It can schedule tasks (also with using schedulers like cron).
  5. It has data quality features: from its own GUI, writing more customised SQL queries and Java.
  6. Talend uses a user friendly and comprehensive IDE (similar to Pentaho Kettle’s) to design the procedures. Procedures can be tested on the IDE and compiled in Java code. The Java generated code can be modified to achieve greater control and flexibility.
  7. Talend Open Studio is a Java code generator tool.
  8. Talend comes with a graphical tool which is very intuitive and helps the entire ETL process from the design to test and deployment.
  9. Talend Open Studio is steeper; however its flexibility and power greatly compensate the first impact.

10. Talend Open Studio requirements are to define the correct schema of the data to be processed and the IDE helps a lot on this task. Anyway metadata definition in Talend is an important feature and helps the maintainability and reliability of the procedure when deployed in production.

11. Talend Open Studio is an Eclipse based Java tool. The procedures are then compiled in Java bytecode during the deployment, this means that the entire Java ecosystem can be potentially used.

12. Components and features are numerous, mixing both general purpose tools and very specific components. Talend provides vendor specific sets of RDBMS, NoSQL, and Big Data components among generic ones; this approach enables the support to both vendor specific features and generic database features.

13. Talend the full Java ecosystem can be used and it’s easy to use vendor specific database features.

14. As a code generator tool Talend Open Studio translates procedures in compact and fast Java.

15. Talend Open Studio (TOS) is a generic ETL and Data Management tool also integrated in the SpagoBI and Jasper Server BI platforms. Procedures are compiled in small Java packages, easily deployable and run able in any Java enabled environment.

 

 

Both ETL Tools

  1. Talend and Pentaho offers some of the most deployed Open Source ETL tools, used in several mission critical implementations.
  2. Talend and Pentaho have strong community support are healthy, well known companies. Open Source Business Intelligence is growing fast and real world applications are widespread.
  3. Talend Open Studio and Pentaho Kettle are both user friendly, well documented and have a strong community support. Talend Open Studio requires more initial effort to get started however its great potential is highly appreciated from the beginning.
  4. The interpreted nature of Kettle makes it sometimes slower in some tasks compared to Talend.
  5. Talend is a single threading code generator (Java or Perl), Kettle uses a metadata driven multi-threaded engine. So it’s your choice: either debug generated Java code (Talend) or debug a graphical data flow (Kettle).
  6. Pentaho Kettle is very easy to use and a good solution in Pentaho environments. Talend is a more general propose Data Management platform that can be used in conjunction with its Talend ESB, Talend Data Quality and Talend MDM companions.
  7. Pentaho is faster (twice as fast maybe) then Talend.
  8. Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.

Talend uses Eclipse as GUI; PDI uses SWT (which Eclipse uses). Depending on how you look at it both have advantages/disadvantages. PDI has a more different look and feel for Eclipse users, but PDI will have fewer problems with Eclipse versions/upgrades.

Vishwanth Surapraju

Helical IT Solutions

Make Batch count for Incremental loading in Talend (TOS_5.3)

This blog will talk about how to make Batch count for Incremental loading in Talend (TOS_5.3).

First all we have t_source and t_target tables
Both Tables have (t_source,t_target) data like this….

before_execute_job

Objective

INSERT into t_source

We inserted one record into t_source
Insert into t_source(id,name,city,s_date) values (111,’vic’,’del’,’2014-03-01 01:02:00′)
UPDATE from t_source
We updated from t_source
Update t_source set name=’don’,s_date=’2014-02-01 01:02:00′ where id = 109;
DELETE from t_source
We deleted from t_source
Delete from t_source where id = 108;

Finally we have records from t_source table and t_target tables

update_Tsource

We want make Batch count in TALEND(TOS)

We created one job…

test

Details of Job

Subjob (1)

We are fetched max(id) from t_target table and we updated into property variable

context.last_id = input_row.id;

Subjob (2)

We are fetching min (id) and max (id) from t_source and we updated into property variables

context.sr_max_id = input_row.max_id;

context.sr_min_id = input_row.min_id;

Subjob (3)

we are selecting from t_source

select * from t_source where  id > “+context.last_id+” order by id

and insert into t_target table by primary key is id

Subjob(4)

we need to count between primary key from t_source

select count(*) as batch_count from t_source where id between “+context.sr_min_id+” and “+context.sr_max_id+”

and updated into property variable. We want to calculate Batch count

We will define by divide count (context.MT_COUNT = 5)  . context.max_count, context.min_count is 0 before Execution of job.

context.count = input_row.count;

System.out.println(“Count of primary key from source “+context.UPLOAD_FILE_NAME+” Table : “+context.count);

 

int x = (context.count / context.MT_COUNT) + 3;

context.batch_count = x;

System.out.println(“Batch Count : “+context.batch_count);

context.max_count = context.MT_COUNT;

 

context.min_count = context.sr_min_id ;

context.max_count = context.sr_min_id + context.max_count;

SubJob (5)

We will iterate by context.batch_count. We have another job(test123) by Iterating.

1.Test123 Job

test123a.       SubJob(5.1)

We are printing Batch count min_count to max_count

System.out.println(“Batch “+Numeric.sequence(“s1″,1,1)+”: Count of “+context.min_count+” to “+context.max_count);    

b.      Subjob(5.2)

We are selecting from t_source between primary key

select * from t_source  where id >= “+context.min_count+” and id <= “+context.max_count+” order by id

and collects data into Buffer output

c.       SubJob (5.3)

We compared by inner join from Buffer input(t_source) and t_target tables in tmap. If any reject output will be there then updated into t_target.

T_target sql query: select * from t_target  where id >= “+context.min_count+” and id <= “+context.max_count+” order by id

d.      SubJob (5.4)

We compared by left outer join from t_target and Buffer input(t_source) in tmap. We filtered t_source.id == 0 and if any output is there then deleted

T_target sql query: select * from t_target  where id >= “+context.min_count+” and id <= “+context.max_count+” order by id

And we have t_javarow(Min, Max of  p_key)

In that,

context.min_count = input_row.id;

context.max_count = context.min_count + context.MT_COUNT;

Results

We Executed the job by defined (property variables)MT_COUNT = 5.

extecion

 

Finally we have records from t_source and t_target.

afterexecited

Thanks & regards

Vishwanth suraparaju

Senior ETL Developer