Loading CSV File BatchWise -Talend

Loading CSV File BatchWise -Talend

In a ETL-Job when the source is a flat file for example a CSV File and size of the file is large. To load large files your job has to read the whole file at a time. It puts load on the server which runs the job. Instead of reading the whole file, reading it in smaller batches and processing them is a better option. Following job explains how this is done using a java program which reads the CSV file with buffered reader and how they are processed later from the folder where the smaller-chunks file are present.

Following is the Java code which reads a file line wise according to the batch size as in how many lines.

package routines;


public class splitfile {

private File file;
private int numberOfLinesThreshold;
private String encoding;
private int readerBufferSize;
private String destinationFileName;

public splitfile(File file, int numberOfLinesThreshold, String encoding,
int readerBufferSize, String destinationFileName) {
this.file = file;
this.numberOfLinesThreshold = numberOfLinesThreshold;
this.encoding = encoding;
this.readerBufferSize = readerBufferSize;
this.destinationFileName = destinationFileName;

public void splitFile() {
String absolutePath = this.file.getAbsolutePath();
int index = absolutePath.lastIndexOf(“.”);

String path = absolutePath.substring(0, index);
String extension = absolutePath.substring(index + 1);

// Store the header for each file
String header = “”;
int fileNameCounter = 1;
int counter = 0;
StringBuilder stringBuilder = new StringBuilder();
String currentLine;
String lineSeparator = System.lineSeparator();

String fileName = path + “.” + extension;

try (BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileName),
this.encoding), this.readerBufferSize)) {
while ((currentLine = bufferedReader.readLine()) != null) {
if (“”.equals(header)) {
header = currentLine;


if (counter == this.numberOfLinesThreshold) {
// Clear counter
counter = 0;

// Utilize the existing string builder
writeFile(fileNameCounter, stringBuilder,
this.destinationFileName, extension);

// For next file name

// For next file content
stringBuilder = new StringBuilder();
// As you have already added one line to the file
} else {
// Write the last few remaining lines as another file
writeFile(fileNameCounter, stringBuilder, this.destinationFileName,
} catch (IOException ex) {

private void writeFile(int fileNameCounter, StringBuilder stringBuilder,
String fileName, String extension) {
// Create a file and write the string builder
String name = getFileName(fileNameCounter, fileName, extension);
try (BufferedWriter bufferedWriter = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(name),
this.encoding))) {
} catch (IOException ex) {

private String getFileName(int fileNameCounter, String fileName,
String extension) {
return fileName + “_” + fileNameCounter + “.” + extension;


Following is the main job which calls the child job in side the job and passes the path which has smaller chunk files.


Step 1: Create Split Files folder

Creates a Folder in order to place all the splitted files. And writes the folder path to the context variable “Src_Split”. This folder is created in the “Src_Dir” and with name Split_(File Name).


Step 2: Split File

Calls a Routine “splitfile” which reads the File in parts with the Help of a Buffered Reader. It takes 5 parameters.

1. Pass a File instance with an escape for the path where the source file is located,

2. The BatchSize which is assigned to context variable “BatchSize”.

3. Standard codes format of the file to the contextVariables “Enco4.des”.

4. Buffer Size to the context Variable “BufferSize”.

5. Destination Folder which is created is passed here followed by- Escape sequence in order to handle “\” and the file name. “context.Src_Split+context.Src_Csv”. (Src_Csv is just the name of the file with No extension.)


Step 3 Split Folder

Give the Directory As “context.Src_Split” and File Mask as “*.csv”

Step 4 CSV2Stg_Stg_Dim_Organization

“Calls the Job which reads the files and processes them to load Stg_Dim_Organization. Just set 2 context variables .
1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”


Step 5 CSV2Stg_Stg_Dim_Department

“Calls the Job which reads the files and processes them to load Stg_Dim_Department. Just set 2 context variables .

1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”

Using MDM components in Talend

Using MDM components in Talend

There are basically 2 types a component can be used in Talend.

1. Built-in

Built-in is giving in all the properties of a component manually yourself. Details like Server connection details, entities, elements, xpath etc.,


Once, one has given all the details for the connection as shown in the above image. In the advanced setting manually give the Loop Xpath Query and in mapping give Columns and Xpath Query as mentioned in the MDM-UI while defining the Entity.


2. Repository

Like any other DB connection one has to create a n MDM connection in the Repository in-order to drag and drop the schema of the MDM components.

Step1: Right click on the MDM Connection listed under metadata in the left side Pane. and select the Create New New Connection.


Step2: Give in the following details and click next


Step3: GIve in the server details like Username, Password and URL. and click on check to check the connection.


Step4: If the given details are correct and connection is established a window pops up like the following.


Step5: Select the data model and container which you have created in the MDM-UI and the one which is deployed and available are shown in the dropdown menu.


Step6: Retrieve the Entities by right clicking on the instance created in the repository and selecting Retrieve Entity


Step7: select the mode u want to use the entity as an Input/Output/Recieve


Step8: select the entity from the dropdown menu and give the name and click next


Step9: Give the target schema xpath and click on finish


Entity is now available in the meta-data. It is in the left side of the UI.


This entity is now available this can be dragged and dropped into job. Schema is also available manual inputs are not required as required in built-in.


Work Flow of Talend MDM

Work Flow of Talend MDM

Talend MDM stores and manages the master data. It also provides platform to do data modelling for the master data. Domains are called entities here. The data is stored into XML format or in sql in relational database. It is based on the setting made according to preference.There are 2 important parts of MDM one must understand before starting with MDM.

  1. Data Model: This is where one can define the schema and assign properties to an entity(domain). The structure of an entity can be defined here like define key, datatype, columns, foreign constraints, dependencies etc..
  2. Data Container: This is where the data physically exists. It can be of wither XML format or SQL as per settings made while installing MDM. Once the Schema/structure is defined, data container is provided with the xpath of the structure defined in the data model.

Following is an example of defining the product data model. Having entities product, sales, customers.

Step1: GO to the Talend UI with MDM and create a sample project selecting the language for source as java or pearl. Then go to the UI there is Data Model in the MDM Repository pane on the left hand side right click and click on +NEW to add an entity.


Give the name as product and click “OK”.
Step2: give the description. Add an entity by right click on the pane below data model entities. And name it “Product”

Step3: Define the Key and Change its simple type to “Custom” and select “Auto Increment” from the drop down.

Step4: Add the elements after Key by adding elements for “Description” and “Code”.


Step5: Repeat all the steps and create 2 more elements with names “Customer” and “Sales”.

Step6: Add extra element “product” in the Customer and put a foreign key constraint on it by right click and pointing to “ProductId” element from Product Entity. Also add a column “product” in sales and do same.


Step7:Then Create a view by right click and select the Generate Default browse items view


U can view the items under the View under Web Filters in the left panel.


Step8: Deploy data model, data container and views to the Local server by adding server details. In the pane below in properties.


Note: The product is automatically creates a data container by assigning xpath to all the elements in all the entities.

Then Finally u get to see it server in url http://localhost:8180/talendmdm/ui . There is Product available in both data model and data container. Click on “Save”. and click on the ” browse master data Items ” to view the entities.


There are 2 ways to put it in the data
1. Directly single record at a time from the browser(One must have the Admin rights to write permissions).
2. By import (You can export the empty entity to get the schema and format in which it accepts the data).

This data can be used in the data integration tool by using MDM components TMDMInput to Read from MDM and TMDMOutput to write the data into MDM. It has similar components like any other database.


One can use and go through “Data Profiling” Perspective of Talend to clean the data and optimize the usage of MDM.

Asha Bidiyasar

Install and setup Talend MDM

Master Data Management -MDM

Master data is the mostly consistent data which is spread across the various systems in an enterprise and possibly their partners. It mostly contains Id, Code, Description and other properties which defines a particular domain. It is not the quantitative data related to a single transaction, it is not an operational data in simple words.

MDM, basically generates a key and manages the master data it involves data cleaning, governance, tracking and control.

Installation and setup of Talend MDM:

Install MDM server from Talend website for the compatible version. This document covers installation of v6.0.0 of MDM server. There are 2 options 1. Executable jar 2. Zip file Extract, select any one of them and do the same.
There are 2 folders in it. One is the UI and the other is server.

  1. UI is nothing but the Talend Data Integration + MDM Perspective included. (TOS_MDM-Studio-20150702_1326-V6.0.0)
  2. MDM sever runs on tomcat sever it uses apache tomcat 8.0.20 version. (TOS_MDM-Server-20150702_1326-V6.0.0)


Step1: If your system’s pop’s up for the firewall permission grant it for running the application.


Step2: Select the Language and click ok.


Step3: Just Click next to proceed


Step4: Accept the license and click next


Step5: Select the server and application if u do not have the Talend UI for MDM.


Step6: Give in the path to install the server.


Select the Datbase type. Community edition gives the option of H2 Embedded database select that and click next


Step8: Select from the check box if u want to use the single user for all the data containers or create individual database for each. Then, give in username and password to proceed.


Step9: Give in the directory for database.


Step10: Cross verify the paths and installation packs..


Step11: Click next and finally on Finish to complete the installation

Run the server by Running “startup.bat” executable file from ……\TOS_MDM-Server-20150702_1326-V6.0.0\apache-tomcat-8.0.20\bin


Run it on browser using localhost:8180/talendmdm/

There are 2 default user

  • Administrator: username/password -> Has all rights read/write
  • User: username/password -> only read
  • img14

    Processing Multi schema files in Talend

    Processing Multi schema files in Talend

    Some times, we will get the input files which contains the data about different schema format but in single file.  We can process these files to get the data schema wise by using tFileinputMSDelimited.

    Raw Data:

    Here is a sample raw data file. In which it exists 3 different schema types.


    If you observe the first column. It represents the type of that data

    ORD means it is Order details

    CUS means it is Customer Details

    and TRS means it is Transactions.

    All these 3 exists in same file. But our requirement is like to have all these 3 types of data in separate files for each type.


    Create the below job to get separate files/outputs for each schema type

    In the palette you will find tFileInputMSDelimited component drag it on to the work area.

    Now open component tab to configure it as shown below:


    Click on Multi Schema Editor And configure as shown in below screen shot:


    If the column separators are different for each schema you can enable Use Multiple Separators option.

    After getting the preview click on the Fetch codes


    Now take 3 output components which ever you need it. I selected tLogrow components to display the result on the console.

    Execute the job:

    Job exec

    Now check the output you will get three different tables based on the schema type.



    Thanks and Regards,


    Handling Null values in Talend and Pentaho

    Handling Null values in Talend and Pentaho

    Handling Null values in Talend:

    Make all the columns as Nullable in tMap

    Now for the columns which are having Null values you can assign default values by writing  ternary operator as shown in below screen-shot.


    Handling Null values in Pentaho:

    In Pentaho, to handle null values we have a component to provide a default value for the null values:

    ‘If field value is null’ is the component name.

    You can assign default value for null values for the fields and also as per the datatypes.

    You can configure the component as shown below:


    Thank you.


    Full Outer Join by using tmap in Talend

    Full Outer Join by using tmap in Talend

    In tMap, the default Join Model is Left Outer Join and also we have  Inner Join as another Join Model. Suppose if we need the result for full outer join ,you can follow the below process.

    I have Customers.csv and City.csv as my two delimited files. where City is the common field in the both the files.

    First, please find the data in the two csv files as below:





    Following is the Talend Job :


    First create a FileDelimited Metadata in the repository as shown in above picture.

    Consider Customers as main link and city as lookup and join both to tMap component. In tMap you have to select Join Model as Inner Join and catch the inner join output reject records and configure as shown in below screen-shot:


    Now as shown in the job image, again take same city and customer inputs but change city as main link and customers as lookup.

    In tmap select the Join Model as Left Outer join and also configure the tMap2 component as shown below:


    Now take tunite component to retrieve the innerjoin reject values from tMap1 and left outer join values from tMap2. And now take tsortrow to sort the result in asc/desc order(I consider custid column) and then the final result stored on fulljoin.csv file.

    And the out put is as follows:


    Thanks and Regards,


    Creating Dynamic Drop Down List in Talend

    Creating Dynamic Drop Down List in Talend

    Created a sample job to retrieve the records from a particular table only by selecting the table name from a drop down list.

    First we have to create tables and load the data into the database. We have to get these tables in a particular database as drop down list. So that we can select a table name at run time.

    Here is a point to note that whenever we are using tjava components it will execute first even in the sub-job. So I am using tprejob component to display the list of table names. Please find the screenshot of the entire job.



    In my required database I have 3 tables. So I have to retrieve these table names as list. to store these names we require one context variable. So create a context variable as shown below:


    I am using mysql database. To retrieve the tables name I have taken tmysqlInput component and configured database name, port,username, password . And in Query part write the below Query:

    “select Table_name from information_schema.`TABLES` where Table_schema='<databasename>'”


    Take tJavaflex component and connect with tmysqlInput by using ‘row’. tJavaflex will execute the code by loop wise for each record of its input flow.

    First we have to create an array list type variable in start code. And then need to add each row (table name) from input flow to this array list in the main code section. And in the End code section, convert the list to an array and then create a drop down list.


    End Code:

    String [] TableNames=new String[tables_list.size()];
    JFrame frame = new JFrame(“Input Dialog Example”);
    String returnValue = (String) JOptionPane.showInputDialog(frame,
    “Select the table name you want to query?”,”Tables Information”,JOptionPane.QUESTION_MESSAGE,null,TableNames,TableNames[0]);
    if(context.tables != null)
    System.out.println(“selected table: “+context.tables);

     //on clicking cancel button in drop down box it will show this message
    else System.out.println(“please select table name”);

    In the Advanced settings of tjavaflex import required java libraries as shown below:


    Pre-job is completed.

    Main Job:

    After selecting the drop down list  based on the selected table name that particular sub-job condition will become true and it will retrieve the records from selected table.

    In case, if we not select the table from drop down list on clicking cancel button, it will throw null pointer exception. To avoid that please enter the below code in all tjava components in the sub-jobs.

    if(context.tables == null)

    Connect tmysqlInput component to tjava by using Runif condition, please write the below code in the runif condition:

    context.tables.contains(“categories”).Here Categories is one of my table name. If we select this table in drop down it will become true and execute otherwise it will become false.

    Now configure the tmysqlInput component with port,user name,database name,password and table name.And now the table name will store in the context variable. so you can use the context variable in place of table name in the select query. Please find the below screen shot:


    And connect the tlogrow to display the result on console.

    Configure the remaining sub-jobs also in the same way.

    Now execute the job, it will display table names in the drop down list.


    If you click on cancel, on the console it will display the message as “Please select the table name”


    If we select any one of the table name from the list, that particular matched condition will become true and that sub-job will execute and display the records of that table on the console. Here I selected Customer table from the drop down so it will display the records from customer table:

    selected table



    Types of CDC supported by Talend Enterprise Edition

    Types of CDC supported by Talend Enterprise Edition :


    CDC (Change Data Capture ) :

    The CDC architecture is based on the publisher/subscriber model.

    The publisher captures the change data and makes it available to the subscribers. The subscribers utilize the change data obtained from the publisher.

    The main tasks performed by the publisher are:

    -Identifying the source tables from which the change data needs to be captured.

    -capturing the change data and storing it in specially created change tables.

    Following are the two different types of CDC modes available in Talend Enterprise Edition.

    1.Trigger mode
    2.Redo/Archive log Mode

    Trigger mode :

    In this mode we need to setup the CDC feature in Talend Studio so that it will create required triggers on the selected database.

    Following are the two major steps required to establish the CDC environment on Talend Studio:

    1.Configure CDC in Trigger mode

    2.Extract change data modified in Trigger mode

    Supported databases for CDC under Trigger mode :

    CDC feature available only for DB2, Informix, Ingres, MS SQL Server, MySQL, PostgreSQL ,Oracle and Sybase.

    Talend provides different CDC component for each supported database.
    Some of them are list below :

    tMySQLCDC, tMSSQLCDC,tDB2CDC,tInformixCDC,tOracleCDC,tPostgreSQLCDC and SybaseCDC

    Redo/Archive log mode :

    The Oracle Redo/Archive log mode used in Talend is equivalent to the archive log mode of Oracle.

    The CDC feature in this mode for Oracle is available for the 10g version of the Oracle database and later versions. Before being able to use CDC in Redo/Archive log mode in Talend Studio, the administrator of the database to be supervised should do the following:

    1.Activate the archive log mode in the Oracle database.

    2.Set up CDC in the Oracle database.

    3.Create and give all rights to the source user.

    4.Create and give all rights to the publisher.

    However,if you want to use CDC in Redo/Archive log mode for an Oracle, you must first of all configure the database so that it generates the redo records that hold all insert, update or delete changes made in datafiles.

    Supported databases for CDC under Redo/Archive log mode :

    CDC feature available only for Oracle and AS400.

    Talend components for this mode are tAS400CDC and tOracleCDC



    Scheduling Talend Job on both Windows and Linux OS

    Scheduling Talend Job on both Windows and Linux OS

    In order to schedule a Talend Job, first we have to export the job as build job which will generate the runnable files to execute on required OS. And also all the child jobs in that project converted as jar files.

    Right click on the parent/main job of your project and select Build Job option.

    It will prompt a window, the job will create a zip file.

    You have to choose the path and build type should be Autonomous. And also enable the option Extract the zip file. It will unzip the above zip file in the same path that you have given.

    Enable Shell Launcher and Context scripts. For Shell Launcher choose All as the value to generate both runnable files for Win OS and Linux OS. It will generate two executable/runnable files like batch file for Windows OS and Shell file for Linux.


    Schedule Talend Job on Windows OS:

    On Windows OS, we can schedule the Talend job by using Windows Task Scheduler.

    On the right side, under actions you can click on Create Basic Task. And then we can select when the task should run as shown in below image:


    You can also select exact time to run:



    Now select ‘Start a Program’ option under Action



    Now browse the batch file and also you can add any context parameters if needed.




    Now click on Finish. Task will be scheduled for the selected time.


    Schedule Talend Job on Linux OS:

    On Linux OS we use crontab to schedule tasks

    Type the below command to open crontab file:

    crontab -e

    And add the job which need to be schedule as shown in below image:


    Hope this post will be useful to refer when scheduling Talend Job.