Python – Get data from CSV and create chart

Python – Get data from CSV and create chart

Hello everyone, this blog shows how we can import data from csv and plot as mesg grid chart in python libraries, through the help of Pycharm IDE (Community Version)

Prerequisite :

The following are the steps for creating mesh-grid in pycharm :
1. Installing NumPy,Pandas and Matplotlib libraries in Pycharm IDE :

  • 1. Create a scratch file in Pycharm.
  • 2. Click on “File” and goto settings.
  • 3. Goto “Project Interpreter” and click on ‘+’ symbol to add new libraries.
  • NumPy

2. Writing python code for importing csv data and creating meshgrid :

  • 1. Create a csv file containing these data “-5,5,0.01”.
  • 2. Use this python code in your scratch file .
  • import csv
    import numpy as np
    import matplotlib.pyplot as plt

    point = ""
    with open('C:/Users/Helical/Desktop/Blog/Python/MeshGrid.csv','r') as csvfile:
    meshreader = csv.reader(csvfile, delimiter=',')
    for row in meshreader:
    point = point + ', '.join(row)
    meshpoints = point.split(',')
    points = np.arange(float(meshpoints[0]),float(meshpoints[1]),float(meshpoints[2]))
    dx, dy = np.meshgrid(points,points)
    z = (np.sin(dx)+np.sin(dy))
    plt.title('plot for sin(x)+sin(y)')

  • If you are able to sucessfully run the code, then you will generate mesh grid in an new Pycharm window.
  • GraphMatplotLib

Loops in Pentaho Data Integration

Loops in PDI


If only there was a Loop Component in PDI *sigh*.

But we can achieve Looping Easily with the Help of few PDI Components. Let’s take a requirement of having to send mails. Now if the mails id and files to be sent where fixed we could just use a property file and send the mail id’s as a parameter.

but here mail ids will be coming in from a Table source and then mail id’s can never be the same.
So how did i approach this?

let’s say mail id’s are from Table: mail_address so now my aim here is to first add a row number to each record line and a Maxcount of all the records at each line, *Keep In Mind of the Maxcount*

and then pass it as a filter clause

where row_number = 1

now my assumption is the query should execute and return one row and then the column values like sender mail, receiver mail will be set into a variable and then passed to a Mail Transformation Component;

But how can we increase the value of the Where clause i.e how can we make

where row_number = 1 to

where row_number = 2

where row_number = 3 and so on.


So here is were Looping is required tp we take a variable and increment the value.

1. Take a Set Variable at the Job Level [Initialize Loop] and then set a variable loop and assign the value to your initial value as shown below: In My case loop value = 1



Now next take a Transformation to get the variables and set the variables as shown below:

resultop1Now in my scenario I have a Modified JS because few conversions were required hence after that set the variable.

Next We need Transformation where we pass our variable of loop into our Query
as shown below



SELECT *,ROW_NUMBER() OVER (Order by createfolderpath) as row_num,COUNT(dummy)           OVER (partition by dummy) as maxcount FROM (SELECT * FROM
SELECT DISTINCT replace(createfolderpath ,’/’,’\’) as createfolderpath,1 as dummy
FROM reportpaths
Cross JOIN [TEST_SLSDMBKP].[dbo].[job_mailing_address]
scheduled_flag = ‘Y’
row_num = ${loop}
ORDER BY row_num,receivers_email,createfolderpath

Now here in my table input, 1 row will be the output so hence I pass the required columns into the Modified JS and then Set them as Variables.


Next at our Job Level we use a JavaScript Component to increment the loop value to 2 as shown above.

Next at our Job Level we add a new transformation which does our requirement of Sending Mails as shown below:


And we pass all the required variables into the Mail Component as shown below:resultop8

RECEVIERSEMAIL and all are values from our DB we Set as Variables.

After that we go to our JOB Level and we use a Simple Evaluation Component

To stop the Loop after a particular condition


Now ${MAXCNT} is the  Maxcount of all the records at each line from our Query we sent as variable

Now once loop number is 6 and Maxcount is  5 then it stops the loop and ends else

It continues and goes to the next Component which is Wait Component which in a previous blog it was specified the importance of a WAIT Component.

So that’s it we went through looping in PDI.

Hence if we have any other approach, please feel free to drop it in the comment below

Sohail Izebhijie

Analytic Query in Pentaho Data Integration

Analytic Query in Pentaho Data Integration [LAG & LEAD]

The could be a time we get a requirement as follows:

Fetch Every Purchases Sum for each Customer in a year, having the following columns

customerid, customername, productcost, purchasemonth

Now this looks easy, but the tricky part would be fetching the


he/she made a purchase

If your using a PostgresDB then this will be easy because there are various windows functions that will help you in acquiring your result with ease

But if your DB doesn’t support Windows Functions then it could be a problem.

So now Pentaho will help you reach your result in very less hustle with the help of the Analytic Query Component

Now lets build our query for one customer

sf97.customer_id as CustomerID,
co.fullname as CustomerName,
SUM(sf97.store_sales) as ProductCost,
tbd.the_month as PurchaseMonth
from sales_fact_1997 sf97
JOIN time_by_day tbd on tbd.time_id = sf97.time_id
JOIN customer co on co.customer_id = sf97.customer_id
WHERE sf97.customer_id = 4873

ORDER BY sf97.customer_id,tbd.month_of_year

Now we Open Up our PDI,  Create a new Transformation,

  • Select Table Input as your Source Table
  • Next Select the Analytic Query


  • Open up Analytic query
  • Now from our requirement we want to fetch the PreviousPurchaseMonth he/she made a purchase for each customer (What month he/she previously made a purchase)
    so the unique item here is the customerID
  • Then in the Group Field Select the customerid
  • In the Analytic Functions
    Enter the name of your new field
    Subject will be the purchasemonth Column because our new column is dependent on that

Type: This can be Lag (Behind) or Lead (Forward)
Click ok

Now right click on Analytic Query Component and Select Preview -> Quick Lunch


There you go.

Note make Sure in your query you have and order by based on your requirement.

Ehizogie Sohail Izebhijie


Cassandra Installation and Cluster Setup on Ubuntu

Cassandra Installation and Cluster Setup

Cassandra Introduction

Apache Cassandra is a free and open-source distributed NoSQL database management system. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Cassandra prerequisites

Minimum hardware requirements:

CPU: 2 cores


Software requirements:

Java 8

Installation and Cluster Setup Steps

To setup a cassandra cluster, we need to install cassandra on each server which will be the part of the cluster.

Here, we are setting up 4 node cassandra cluster.

Suppose we are having 4 nodes, namely,





Note: In order to set up a cluster, perform the following steps on each involved server.


  1. Install Java 8 on each server using following commands on the terminal. (Optional if already installed)
    1. To make the Open JDK package available, you’ll have to add a Personal Package Archives (PPA) using this command.
      sudo add-apt-repository ppa:openjdk-r/ppa
    2. Update the package database.
      sudo apt-get update
    3. Now install the Java 8.
      sudo apt-get install openjdk-8-jdk
  2. Install cassandra on each servers using the following list of commands on the terminal. Here we are installing Cassandra version 3.10.
    1. echo "deb 310x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

      Note 310x means we are installing cassandra of version 3.10. You can change this value if you want to install different version of Cassandra.

    2. Add the Apache Cassandra repository keys using following command.
      curl | sudo apt-key add -
    3. Update the package database using following command.
      sudo apt-get update

      Note If you encounter any public key related error, such as “The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA”.

      Then add the public key A278B781FE4B2BDA as follows:

      sudo apt-key adv –keyserver –recv-key A278B781FE4B2BDA

      Now repeat ‘sudo apt-get update‘. The actual key may be different, you get it from the error message itself.

    4. Now install the Cassandra using following command.
      sudo apt-get install cassandra

      By now, you have installed cassandra on each involved server.
      Go through the following steps to setup cluster.

Cluster Setup Steps

  1. Stop cassandra service on each server (XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3,XXX.XX.XX.4) using following command.
    sudo service cassandra stop
  2. Delete default cassandra data from each server located in data directory of cassandra using following command.
    sudo rm -rf /var/lib/cassandra/data/system/*
  3. Now open the cassandra.yaml file located in /etc/cassandra directory on each server one after another to configure cluster.
    Note Following are the parameters from cassandra.yaml file that needs to be configured in order to setup cluster.
    cluster_name : You change the cluster name by setting this parameter. It is optional.
    -seeds : This is a comma-delimited list of the IP address of each node in the cluster.
    listen_address : This is IP address that other nodes in the cluster will use to connect to this one. It defaults to localhost and needs changed to the IP address of the node.
    rpc_address : This is the IP address for remote procedure calls. It defaults to localhost. If the server’s hostname is properly configured, leave this as is. Otherwise, change to server’s IP address or the loopback address (
    endpoint_snitch : Name of the snitch, which is what tells Cassandra about what its network looks like. This defaults to SimpleSnitch, which is used for networks in one datacenter. In our case, we’ll change it to GossipingPropertyFileSnitch, which is preferred for production setups.
    auto_bootstrap : This directive is not in the configuration file, so it has to be added and set to false. This makes new nodes automatically use the right data. It is optional if you’re adding nodes to an existing cluster, but required when you’re initializing a fresh cluster, that is, one with no data.

    1. Open cassandra.yaml on XXX.XX.XX.1 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.1
       rpc_address: XXX.XX.XX.1
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    2. Open cassandra.yaml on XXX.XX.XX.2 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.2,XXX.XX.XX.1,XXX.XX.XX.3,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.2
       rpc_address: XXX.XX.XX.2
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    3. Open cassandra.yaml on XXX.XX.XX.3 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.3,XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.3
       rpc_address: XXX.XX.XX.3
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    4. Open cassandra.yaml on XXX.XX.XX.4 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.4,XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3"
       listen_address: XXX.XX.XX.4
       rpc_address: XXX.XX.XX.4
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
  4. Start the Cassandra daemon on each server on each server.
    sudo service cassandra start

    Note You can check the status of cassandra using following command.

    sudo service cassandra status
  5. Check the status of the cassandra cluster using nodetool utility command. Fire command on one node say XXX.XX.XX.1 as follows,
    sudo nodetool status

    You will get output like,

    Datacenter: DC1
    |/ State=Normal/Leaving/Joining/Moving
    -- Address     Load       Tokens Owns (effective) Host ID                              Rack
    UN XXX.XX.XX.1 30.94 GiB  256    100.0%           ae45c3c5-30e3-4d60-9e5c-3f0004bbfea4 rack1

    Note You’ll find that only the local node is listed, because it’s not yet able to communicate with the other nodes.

  6. Allow communication by opening the following network ports on each node using the following commands.
    sudo ufw allow 7000

    7000 is the TCP port for commands and data.

    sudo ufw allow 9042

    9042 is the TCP port for the native transport server. cqlsh, the Cassandra command line utility, will connect to the cluster through this port.
    Note You can change the port number as required by changing it in cassandra.yaml
    storage_port: 7000
    native_transport_port: 9042

  7. Now Check the status of the cassandra cluster again.
    sudo nodetool status

    Output would be like this,

    Datacenter: dc1
    |/ State=Normal/Leaving/Joining/Moving
    -- Address     Load       Tokens Owns (effective) Host ID                              Rack
    UN XXX.XX.XX.3 31.3 GiB   256    100.0%           20412cb7-b9bf-4373-bf6f-d0958a993c95 rack1
    UN XXX.XX.XX.4 30.78 GiB  256    100.0%           a49ee892-f816-4e0d-84b4-3b3bf1421f33 rack1
    UN XXX.XX.XX.1 30.94 GiB  256    100.0%           ae45c3c5-30e3-4d60-9e5c-3f0004bbfea4 rack1
    UN XXX.XX.XX.2 32.5 GiB   256    100.0%           f9084fd5-06f4-4408-ba90-32c9afd60ede rack1

You can see that all the nodes are up and normal.

In this way, you can install and setup cassandra cluster.


In this blog, we have discussed how can we install and setup the cassandra cluster.



Neo4j : Graph Database

Neo4j : Graph Database

  • NoSQL graph database
  • Usecases : include matchmaking, network management, software analytics, scientific research, routing, organizational and project management, recommendations, social networks, and more.
  • Neo4j implements the Property Graph Model efficiently down to the storage level.
  • Neo4j provides full database characteristics including ACID transaction compliance, cluster support, and runtime failover, making it suitable to use graph data in production scenarios.
  • Used for connected DB
  • Easy traversing of data( Ex.Google+ , Facebook linkedin etc)
  • These applications contain lots of structured, semi-structured and unstructured connected data. It is not easy to represent this kind of unstructured connected data in RDBMS Databases.
  • Graph DBMS stores this kind of more connected data very easily. It stores each Profile data as a Node internally, it’s connected Nodes with adjacent nodes and they connect each other with Relationships.


Property Graph Model in Graph Database

  • The property graph contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs). Nodes can be tagged with labels representing their different roles in your domain. In addition to contextualizing node and relationship properties, labels may also serve to attach metadata—index or constraint information—to certain nodes.
  • Relationships provide directed, named semantically relevant connections between two node-entities. A relationship always has a direction, a type, a start node, and an end node.
  • Relationships have quantitative properties, such as weights, costs, distances, ratings, time intervals, or strengths.


Graph entities in Neo4j :

Nodes              : Acts as row

Label                : Acts as table

Relationships     : Acts as joins

Properties          : Acts as column name

Note : Nodes are directed but relationships can be navigated.


Properties of Graph database:

  • No broken links
  • A relationship always has a start and end node, you can’t delete a node without also deleting its associated relationships.
  • You can also always assume that an existing relationship will never point to a non-existing endpoint.


Installation steps of neo4j  for windows:

  • Download .exe or .zip file.
  • Install Neo4j from official website.
  • Run its exe file.
  • Set DB location for neo4j and start server.
  • You can access neo4j in browser using link : http://localhost:7474/browser/ with default
  •  User and Password:neo4j


Note : You need to set new password and then again you can connect to Neo4j.


To install neo4j  on ubuntu refer below links :


Following are some Cipher Query Language syntax used in Neo4j :



Following are some neo4j cypher queries compared with mysql queries for better understanding :


                                                                 Figure : 2

Hope you will get how cypher queries differs from mysql.



Thanks ,



Change Hive metastore from derby to MySQL


Change Hive metastore from derby to MySQL


Machine : UBUNTU-14.04 |  Hive :  HIve 1.2.1

To change Hive Metastore from Derby to MySQL we need to follow these 8 simple steps,


Step-1 :

First we need to install Mysql and its dependencies on system.

Command-1 : sudo apt-get install mysql-server

Note: Click Next > Next and set the password for MySQL.

Command-2 :  sudo apt-get install libmysql-java


Step-2 :

Create soft-link for connector in Hive lib directory.

Command-1 : ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar


Step-3 :

Access your MySQL and create one new database metastore for hive,

Command : $ mysql -u root -p

Enter password:

mysql> CREATE DATABASE metastore;


Step-4 :

Then we need one MySQL account for Hive to access the metastore. It is very important to prevent this user account from any type of change in schema.

Command :

mysql> CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘password’;

mysql> GRANT all on *.* to ‘hive’@localhost identified by ‘password’;

mysql>  flush privileges;


Step-5 :

Now we need to configure Hive to access MySQL metastore, for this we need to update hive-site.xml file(If file does not exist then create a new one)





<description>metadata is stored in a MySQL server</description>





<description>MySQL JDBC driver class</description>





<description>user name for connecting to mysql server</description>





<description>password for connecting to mysql server</description>




Note: While updating please take all properties-tag only.


Step-6 :

Now we need to run the Hive schematool to initialization MySQL metastore.

For this we need to go to $HIVE_HOME>bin> folder

Command-1 : schematool -initSchema -dbType mysql

Note : When you have found that your metastore is corrupted, then we need to update metastore.

  • Before you run hive for the first time, run

Command : schematool -initSchema -dbType mysql

  • If you already ran hive and then tried to initSchema and if it’s failing:

Command : mv metastore_db metastore_db.tmp

You find your metasore_db file at $HIVE_HOME location.

  • Re run


Step-7 :

Start your Hive and access your tables.


Step-8 :

To validate it ,

Connect and open your hive


Then create a table in it and insert one record.

hive> create table saurzcode(id int, name string);

hive> insert into saurzcode(1, “Helical”);

Later access your MySQL and open metastore database

mysql -u root -p

Enter password:

mysql> use metastore;

And see your table as a record in TBLS table of metastore database.

mysql> show tables ;

mysql> select * from TBLS;


Hurray, Completed !!!!


Thanks for visiting , Have a great day.



Foreign Data Wrappers in PostgreSQL

Foreign Data Wrappers in PostgreSQL


Foreign Data Wrapper (FDW) is new concept which introduced in PostgreSQL 9.1. It allows us to access external non-Postgres data in the form of regular Postgres table. FDW allow us to make external data sources (text files, web services, etc.) look like tables and have easy access to them using SQL commands.

FDW is a C library that acts like an adapter. Each foreign data wrapper is responsible for interacting with a particular external data. For example, it can be used in Finance Department reports as a CSV that you need to reconcile to account or in any data migration.

Uses :

1) Urgent Data Migration

2) Data  Analysis

3) Testing Purpose

There are various uses for FDW and in many Cases. Below are some examples:

Urgent Data Migration

Imagine that you have a project which uses a database that costs a lot of money. At some time in the project life-cycle it may happen that the most expensive features that the database provides aren’t used and there is no need to put a lot of money into something that can be replaced by an open source database like PostgreSQL. You can use an appropriate FDW to migrate your data from your costly external DB to Postgres.

Data Analysis

You may find yourself in a situation where you have to perform some database analytics  for an external partner from PostgreSQL. Instead of writing scripts to update the data, We can use FDW, you can set up foreign tables and keep them updated. And since the new version for PostgreSQL 9.3 provides write capability, FDW for MySQL is write-able.

Testing Purpose

If we don’t want to affect real data we can retrieve the data and manage it locally without interaction in the main database. This can be useful in performing tests.

To use the FDW there’s four basic things you’ll want to do:

Create Extension

Create the Remote server

Create a user mapping for the Remote server

Create your foreign tables

Start querying on data

Example : Here is an example of creating a foreign table with postgres_fdw.

Create Extension  First install the extension:

CREATE EXTENSION postgres_fdw;

Create the Foreign server

Then create a foreign server using CREATE SERVER. In this example we wish to connect to a PostgreSQL server on host listening on port 5432. The database to which the connection is made is named foreign_db on the remote server:

CREATE SERVER foreign_server


OPTIONS (host ‘’, port ‘5432’, dbname ‘foreign_db’);

Create a user mapping for the Foreign server

A user mapping, defined with CREATE USER MAPPING, is needed as well to identify the role that will be used on the remote server:


SERVER foreign_server

OPTIONS (user ‘foreign_user’, password ‘password’);

Create your foreign tables

Now it is possible to create a foreign table with CREATE FOREIGN TABLE. In this example we wish to access the table named some_schema.some_table on the remote server. The local name for it will be foreign_table:

CREATE FOREIGN TABLE foreign_table (

id integer NOT NULL,

data text )

SERVER foreign_server

OPTIONS (schema_name ‘some_schema’, table_name ‘some_table’);

Start querying on data

SELECT * FROM foreign_table LIMIT 5;


Sample Implementation :



— Foreign Server: wserver

— DROP SERVER wserver



OPTIONS (hostaddr ‘’,dbname ‘Basket_one’,port ‘5434’);

ALTER SERVER backupserver OWNER TO Sachin;

— User Mappings


OPTIONS (username ‘foo’, password ‘bar’);

Create a table to hold the data in PostgreSQL as foreign table


Employee_id int,

Employee_name text )

SERVER wserver

OPTIONS (dbname ‘Senteno’, table_name ‘Employee’ );


Thanks & Regards

A Prasanth Kumar



Lateral join is a very incredible feature in Postgres (Postgres 9.3+) .The LATERAL key word can precede a sub-SELECT FROM item.  It allows the sub-SELECT to point to the  columns of FROM items that appear before it in the FROM list.  We can say it’s used for cross-referencing.It basically allows sub queries in the right-hand-side of the join to reference columns defined in the left-hand-side of the join.

It can be explained from a simple example :

Suppose there is a business scenario where you want to just check whether a set of people belong in one subclass or not. And accordingly we have do the manipulations in the outer query.

Select * from
(Select  empid , case when  x.flag = ‘1’then ‘YES’ else ‘NO’ end as check from getempdata (‘04/03/2017’))base   —– getempdata is the function

left join lateral

(select case when pbe.benefitsubclass in (‘6Z’,’6Y’) then ‘1’ end as flag , pbe.personid , pbe.benefitsubclass,pbe.monthlyamount as amt1, pbe.monthlyemployeramount as amt2

from person_bene_election pbe

where  base.empid = pbe.personid

and current_timestamp between pbe.createts and pbe.endts

and to_date(’04/03/2017′, ‘mm/dd/yyyy’) between pbe.effectivedate and  pbe.enddate

and pbe.benefitelection in (‘E’,’T’)

and pbe.benefitsubclass in (‘6Z’)

)x on x.personid = base.empid

Left join lateral allows us to use the empid from the function in left side of the lateral join to be used for reference in the right side.
Also with the left join lateral returns multiple rows , and also the flag column used in select  of lateral join is  used in the main FROM CLAUSE.

In a gist, lateral  allows you to perform a sub-query in the FROM clause and reference column values from other records in the query. It is similar to a correlated subquery in that it can reference values from another query.

It has some advantages over the the correlated subquery , it can return multiple rows and can be used in the From clause.


Java Naming and Directory Interface (JNDI)

Using JNDI, applications running in distributed environment, helps to look up services independently.

How JNDI is useful?

  1.  During deployment sequence, the same JNDI name can be used in different servers.
  2.  It helps to provide security of credential leakage to the user groups like developer, testers, stagers etc, because the credentials are configured in the server not in the application.
  3.  Connection Pooling can be achieved

How to use JNDI?
In order to use, JNDI classes and service providers should be there
The service provider may be :

  1. Lightweight Directory Access Protocol (LDAP)
  2. Common Object Request Broker Architecture (CORBA)
  3. Common Object Services (COS) name service
  4. Java Remote Method Invocation (RMI) RegistryThe objects needs to be created and registered so that , they can be further used for lookup.

How to configure JNDI in tomcat?
In tomcat environment we can configure JNDI in 3 different places
1. Inside the application using context.xml in META-INF directory.
2. Inside the tomcat server context.xml : Inside the apache-tomcat/conf directory we can configure. This can be helpful when multiple application needs the same database.
3. Both in server.xml and context.xml

NB: The database driver jar should be present in the tomcat lib directory

<Resource name="jdbc/lookupName" 


In order to use it in java program we have to use the followng
DataSource ds = (DataSource) ctx.lookup(“java:/comp/env/jdbc/lookupName”);


Loading CSV File BatchWise -Talend

Loading CSV File BatchWise -Talend

In a ETL-Job when the source is a flat file for example a CSV File and size of the file is large. To load large files your job has to read the whole file at a time. It puts load on the server which runs the job. Instead of reading the whole file, reading it in smaller batches and processing them is a better option. Following job explains how this is done using a java program which reads the CSV file with buffered reader and how they are processed later from the folder where the smaller-chunks file are present.

Following is the Java code which reads a file line wise according to the batch size as in how many lines.

package routines;


public class splitfile {

private File file;
private int numberOfLinesThreshold;
private String encoding;
private int readerBufferSize;
private String destinationFileName;

public splitfile(File file, int numberOfLinesThreshold, String encoding,
int readerBufferSize, String destinationFileName) {
this.file = file;
this.numberOfLinesThreshold = numberOfLinesThreshold;
this.encoding = encoding;
this.readerBufferSize = readerBufferSize;
this.destinationFileName = destinationFileName;

public void splitFile() {
String absolutePath = this.file.getAbsolutePath();
int index = absolutePath.lastIndexOf(“.”);

String path = absolutePath.substring(0, index);
String extension = absolutePath.substring(index + 1);

// Store the header for each file
String header = “”;
int fileNameCounter = 1;
int counter = 0;
StringBuilder stringBuilder = new StringBuilder();
String currentLine;
String lineSeparator = System.lineSeparator();

String fileName = path + “.” + extension;

try (BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileName),
this.encoding), this.readerBufferSize)) {
while ((currentLine = bufferedReader.readLine()) != null) {
if (“”.equals(header)) {
header = currentLine;


if (counter == this.numberOfLinesThreshold) {
// Clear counter
counter = 0;

// Utilize the existing string builder
writeFile(fileNameCounter, stringBuilder,
this.destinationFileName, extension);

// For next file name

// For next file content
stringBuilder = new StringBuilder();
// As you have already added one line to the file
} else {
// Write the last few remaining lines as another file
writeFile(fileNameCounter, stringBuilder, this.destinationFileName,
} catch (IOException ex) {

private void writeFile(int fileNameCounter, StringBuilder stringBuilder,
String fileName, String extension) {
// Create a file and write the string builder
String name = getFileName(fileNameCounter, fileName, extension);
try (BufferedWriter bufferedWriter = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(name),
this.encoding))) {
} catch (IOException ex) {

private String getFileName(int fileNameCounter, String fileName,
String extension) {
return fileName + “_” + fileNameCounter + “.” + extension;


Following is the main job which calls the child job in side the job and passes the path which has smaller chunk files.


Step 1: Create Split Files folder

Creates a Folder in order to place all the splitted files. And writes the folder path to the context variable “Src_Split”. This folder is created in the “Src_Dir” and with name Split_(File Name).


Step 2: Split File

Calls a Routine “splitfile” which reads the File in parts with the Help of a Buffered Reader. It takes 5 parameters.

1. Pass a File instance with an escape for the path where the source file is located,

2. The BatchSize which is assigned to context variable “BatchSize”.

3. Standard codes format of the file to the contextVariables “Enco4.des”.

4. Buffer Size to the context Variable “BufferSize”.

5. Destination Folder which is created is passed here followed by- Escape sequence in order to handle “\” and the file name. “context.Src_Split+context.Src_Csv”. (Src_Csv is just the name of the file with No extension.)


Step 3 Split Folder

Give the Directory As “context.Src_Split” and File Mask as “*.csv”

Step 4 CSV2Stg_Stg_Dim_Organization

“Calls the Job which reads the files and processes them to load Stg_Dim_Organization. Just set 2 context variables .
1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”


Step 5 CSV2Stg_Stg_Dim_Department

“Calls the Job which reads the files and processes them to load Stg_Dim_Department. Just set 2 context variables .

1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”