Neo4j : Graph Database

Neo4j : Graph Database

  • NoSQL graph database
  • Usecases : include matchmaking, network management, software analytics, scientific research, routing, organizational and project management, recommendations, social networks, and more.
  • Neo4j implements the Property Graph Model efficiently down to the storage level.
  • Neo4j provides full database characteristics including ACID transaction compliance, cluster support, and runtime failover, making it suitable to use graph data in production scenarios.
  • Used for connected DB
  • Easy traversing of data( Ex.Google+ , Facebook linkedin etc)
  • These applications contain lots of structured, semi-structured and unstructured connected data. It is not easy to represent this kind of unstructured connected data in RDBMS Databases.
  • Graph DBMS stores this kind of more connected data very easily. It stores each Profile data as a Node internally, it’s connected Nodes with adjacent nodes and they connect each other with Relationships.


Property Graph Model in Graph Database

  • The property graph contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs). Nodes can be tagged with labels representing their different roles in your domain. In addition to contextualizing node and relationship properties, labels may also serve to attach metadata—index or constraint information—to certain nodes.
  • Relationships provide directed, named semantically relevant connections between two node-entities. A relationship always has a direction, a type, a start node, and an end node.
  • Relationships have quantitative properties, such as weights, costs, distances, ratings, time intervals, or strengths.


Graph entities in Neo4j :

Nodes              : Acts as row

Label                : Acts as table

Relationships     : Acts as joins

Properties          : Acts as column name

Note : Nodes are directed but relationships can be navigated.


Properties of Graph database:

  • No broken links
  • A relationship always has a start and end node, you can’t delete a node without also deleting its associated relationships.
  • You can also always assume that an existing relationship will never point to a non-existing endpoint.


Installation steps of neo4j  for windows:

  • Download .exe or .zip file.
  • Install Neo4j from official website.
  • Run its exe file.
  • Set DB location for neo4j and start server.
  • You can access neo4j in browser using link : http://localhost:7474/browser/ with default
  •  User and Password:neo4j


Note : You need to set new password and then again you can connect to Neo4j.


To install neo4j  on ubuntu refer below links :


Following are some Cipher Query Language syntax used in Neo4j :



Following are some neo4j cypher queries compared with mysql queries for better understanding :


                                                                 Figure : 2

Hope you will get how cypher queries differs from mysql.



Thanks ,



Change Hive metastore from derby to MySQL


Change Hive metastore from derby to MySQL


Machine : UBUNTU-14.04 |  Hive :  HIve 1.2.1

To change Hive Metastore from Derby to MySQL we need to follow these 8 simple steps,


Step-1 :

First we need to install Mysql and its dependencies on system.

Command-1 : sudo apt-get install mysql-server

Note: Click Next > Next and set the password for MySQL.

Command-2 :  sudo apt-get install libmysql-java


Step-2 :

Create soft-link for connector in Hive lib directory.

Command-1 : ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar


Step-3 :

Access your MySQL and create one new database metastore for hive,

Command : $ mysql -u root -p

Enter password:

mysql> CREATE DATABASE metastore;


Step-4 :

Then we need one MySQL account for Hive to access the metastore. It is very important to prevent this user account from any type of change in schema.

Command :

mysql> CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘password’;

mysql> GRANT all on *.* to ‘hive’@localhost identified by ‘password’;

mysql>  flush privileges;


Step-5 :

Now we need to configure Hive to access MySQL metastore, for this we need to update hive-site.xml file(If file does not exist then create a new one)





<description>metadata is stored in a MySQL server</description>





<description>MySQL JDBC driver class</description>





<description>user name for connecting to mysql server</description>





<description>password for connecting to mysql server</description>




Note: While updating please take all properties-tag only.


Step-6 :

Now we need to run the Hive schematool to initialization MySQL metastore.

For this we need to go to $HIVE_HOME>bin> folder

Command-1 : schematool -initSchema -dbType mysql

Note : When you have found that your metastore is corrupted, then we need to update metastore.

  • Before you run hive for the first time, run

Command : schematool -initSchema -dbType mysql

  • If you already ran hive and then tried to initSchema and if it’s failing:

Command : mv metastore_db metastore_db.tmp

You find your metasore_db file at $HIVE_HOME location.

  • Re run


Step-7 :

Start your Hive and access your tables.


Step-8 :

To validate it ,

Connect and open your hive


Then create a table in it and insert one record.

hive> create table saurzcode(id int, name string);

hive> insert into saurzcode(1, “Helical”);

Later access your MySQL and open metastore database

mysql -u root -p

Enter password:

mysql> use metastore;

And see your table as a record in TBLS table of metastore database.

mysql> show tables ;

mysql> select * from TBLS;


Hurray, Completed !!!!


Thanks for visiting , Have a great day.



Foreign Data Wrappers in PostgreSQL

Foreign Data Wrappers in PostgreSQL


Foreign Data Wrapper (FDW) is new concept which introduced in PostgreSQL 9.1. It allows us to access external non-Postgres data in the form of regular Postgres table. FDW allow us to make external data sources (text files, web services, etc.) look like tables and have easy access to them using SQL commands.

FDW is a C library that acts like an adapter. Each foreign data wrapper is responsible for interacting with a particular external data. For example, it can be used in Finance Department reports as a CSV that you need to reconcile to account or in any data migration.

Uses :

1) Urgent Data Migration

2) Data  Analysis

3) Testing Purpose

There are various uses for FDW and in many Cases. Below are some examples:

Urgent Data Migration

Imagine that you have a project which uses a database that costs a lot of money. At some time in the project life-cycle it may happen that the most expensive features that the database provides aren’t used and there is no need to put a lot of money into something that can be replaced by an open source database like PostgreSQL. You can use an appropriate FDW to migrate your data from your costly external DB to Postgres.

Data Analysis

You may find yourself in a situation where you have to perform some database analytics  for an external partner from PostgreSQL. Instead of writing scripts to update the data, We can use FDW, you can set up foreign tables and keep them updated. And since the new version for PostgreSQL 9.3 provides write capability, FDW for MySQL is write-able.

Testing Purpose

If we don’t want to affect real data we can retrieve the data and manage it locally without interaction in the main database. This can be useful in performing tests.

To use the FDW there’s four basic things you’ll want to do:

Create Extension

Create the Remote server

Create a user mapping for the Remote server

Create your foreign tables

Start querying on data

Example : Here is an example of creating a foreign table with postgres_fdw.

Create Extension  First install the extension:

CREATE EXTENSION postgres_fdw;

Create the Foreign server

Then create a foreign server using CREATE SERVER. In this example we wish to connect to a PostgreSQL server on host listening on port 5432. The database to which the connection is made is named foreign_db on the remote server:

CREATE SERVER foreign_server


OPTIONS (host ‘’, port ‘5432’, dbname ‘foreign_db’);

Create a user mapping for the Foreign server

A user mapping, defined with CREATE USER MAPPING, is needed as well to identify the role that will be used on the remote server:


SERVER foreign_server

OPTIONS (user ‘foreign_user’, password ‘password’);

Create your foreign tables

Now it is possible to create a foreign table with CREATE FOREIGN TABLE. In this example we wish to access the table named some_schema.some_table on the remote server. The local name for it will be foreign_table:

CREATE FOREIGN TABLE foreign_table (

id integer NOT NULL,

data text )

SERVER foreign_server

OPTIONS (schema_name ‘some_schema’, table_name ‘some_table’);

Start querying on data

SELECT * FROM foreign_table LIMIT 5;


Sample Implementation :



— Foreign Server: wserver

— DROP SERVER wserver



OPTIONS (hostaddr ‘’,dbname ‘Basket_one’,port ‘5434’);

ALTER SERVER backupserver OWNER TO Sachin;

— User Mappings


OPTIONS (username ‘foo’, password ‘bar’);

Create a table to hold the data in PostgreSQL as foreign table


Employee_id int,

Employee_name text )

SERVER wserver

OPTIONS (dbname ‘Senteno’, table_name ‘Employee’ );


Thanks & Regards

A Prasanth Kumar



Lateral join is a very incredible feature in Postgres (Postgres 9.3+) .The LATERAL key word can precede a sub-SELECT FROM item.  It allows the sub-SELECT to point to the  columns of FROM items that appear before it in the FROM list.  We can say it’s used for cross-referencing.It basically allows sub queries in the right-hand-side of the join to reference columns defined in the left-hand-side of the join.

It can be explained from a simple example :

Suppose there is a business scenario where you want to just check whether a set of people belong in one subclass or not. And accordingly we have do the manipulations in the outer query.

Select * from
(Select  empid , case when  x.flag = ‘1’then ‘YES’ else ‘NO’ end as check from getempdata (‘04/03/2017’))base   —– getempdata is the function

left join lateral

(select case when pbe.benefitsubclass in (‘6Z’,’6Y’) then ‘1’ end as flag , pbe.personid , pbe.benefitsubclass,pbe.monthlyamount as amt1, pbe.monthlyemployeramount as amt2

from person_bene_election pbe

where  base.empid = pbe.personid

and current_timestamp between pbe.createts and pbe.endts

and to_date(’04/03/2017′, ‘mm/dd/yyyy’) between pbe.effectivedate and  pbe.enddate

and pbe.benefitelection in (‘E’,’T’)

and pbe.benefitsubclass in (‘6Z’)

)x on x.personid = base.empid

Left join lateral allows us to use the empid from the function in left side of the lateral join to be used for reference in the right side.
Also with the left join lateral returns multiple rows , and also the flag column used in select  of lateral join is  used in the main FROM CLAUSE.

In a gist, lateral  allows you to perform a sub-query in the FROM clause and reference column values from other records in the query. It is similar to a correlated subquery in that it can reference values from another query.

It has some advantages over the the correlated subquery , it can return multiple rows and can be used in the From clause.


Java Naming and Directory Interface (JNDI)

Using JNDI, applications running in distributed environment, helps to look up services independently.

How JNDI is useful?

  1.  During deployment sequence, the same JNDI name can be used in different servers.
  2.  It helps to provide security of credential leakage to the user groups like developer, testers, stagers etc, because the credentials are configured in the server not in the application.
  3.  Connection Pooling can be achieved

How to use JNDI?
In order to use, JNDI classes and service providers should be there
The service provider may be :

  1. Lightweight Directory Access Protocol (LDAP)
  2. Common Object Request Broker Architecture (CORBA)
  3. Common Object Services (COS) name service
  4. Java Remote Method Invocation (RMI) RegistryThe objects needs to be created and registered so that , they can be further used for lookup.

How to configure JNDI in tomcat?
In tomcat environment we can configure JNDI in 3 different places
1. Inside the application using context.xml in META-INF directory.
2. Inside the tomcat server context.xml : Inside the apache-tomcat/conf directory we can configure. This can be helpful when multiple application needs the same database.
3. Both in server.xml and context.xml

NB: The database driver jar should be present in the tomcat lib directory

<Resource name="jdbc/lookupName" 


In order to use it in java program we have to use the followng
DataSource ds = (DataSource) ctx.lookup(“java:/comp/env/jdbc/lookupName”);


Loading CSV File BatchWise -Talend

Loading CSV File BatchWise -Talend

In a ETL-Job when the source is a flat file for example a CSV File and size of the file is large. To load large files your job has to read the whole file at a time. It puts load on the server which runs the job. Instead of reading the whole file, reading it in smaller batches and processing them is a better option. Following job explains how this is done using a java program which reads the CSV file with buffered reader and how they are processed later from the folder where the smaller-chunks file are present.

Following is the Java code which reads a file line wise according to the batch size as in how many lines.

package routines;


public class splitfile {

private File file;
private int numberOfLinesThreshold;
private String encoding;
private int readerBufferSize;
private String destinationFileName;

public splitfile(File file, int numberOfLinesThreshold, String encoding,
int readerBufferSize, String destinationFileName) {
this.file = file;
this.numberOfLinesThreshold = numberOfLinesThreshold;
this.encoding = encoding;
this.readerBufferSize = readerBufferSize;
this.destinationFileName = destinationFileName;

public void splitFile() {
String absolutePath = this.file.getAbsolutePath();
int index = absolutePath.lastIndexOf(“.”);

String path = absolutePath.substring(0, index);
String extension = absolutePath.substring(index + 1);

// Store the header for each file
String header = “”;
int fileNameCounter = 1;
int counter = 0;
StringBuilder stringBuilder = new StringBuilder();
String currentLine;
String lineSeparator = System.lineSeparator();

String fileName = path + “.” + extension;

try (BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileName),
this.encoding), this.readerBufferSize)) {
while ((currentLine = bufferedReader.readLine()) != null) {
if (“”.equals(header)) {
header = currentLine;


if (counter == this.numberOfLinesThreshold) {
// Clear counter
counter = 0;

// Utilize the existing string builder
writeFile(fileNameCounter, stringBuilder,
this.destinationFileName, extension);

// For next file name

// For next file content
stringBuilder = new StringBuilder();
// As you have already added one line to the file
} else {
// Write the last few remaining lines as another file
writeFile(fileNameCounter, stringBuilder, this.destinationFileName,
} catch (IOException ex) {

private void writeFile(int fileNameCounter, StringBuilder stringBuilder,
String fileName, String extension) {
// Create a file and write the string builder
String name = getFileName(fileNameCounter, fileName, extension);
try (BufferedWriter bufferedWriter = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(name),
this.encoding))) {
} catch (IOException ex) {

private String getFileName(int fileNameCounter, String fileName,
String extension) {
return fileName + “_” + fileNameCounter + “.” + extension;


Following is the main job which calls the child job in side the job and passes the path which has smaller chunk files.


Step 1: Create Split Files folder

Creates a Folder in order to place all the splitted files. And writes the folder path to the context variable “Src_Split”. This folder is created in the “Src_Dir” and with name Split_(File Name).


Step 2: Split File

Calls a Routine “splitfile” which reads the File in parts with the Help of a Buffered Reader. It takes 5 parameters.

1. Pass a File instance with an escape for the path where the source file is located,

2. The BatchSize which is assigned to context variable “BatchSize”.

3. Standard codes format of the file to the contextVariables “Enco4.des”.

4. Buffer Size to the context Variable “BufferSize”.

5. Destination Folder which is created is passed here followed by- Escape sequence in order to handle “\” and the file name. “context.Src_Split+context.Src_Csv”. (Src_Csv is just the name of the file with No extension.)


Step 3 Split Folder

Give the Directory As “context.Src_Split” and File Mask as “*.csv”

Step 4 CSV2Stg_Stg_Dim_Organization

“Calls the Job which reads the files and processes them to load Stg_Dim_Organization. Just set 2 context variables .
1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”


Step 5 CSV2Stg_Stg_Dim_Department

“Calls the Job which reads the files and processes them to load Stg_Dim_Department. Just set 2 context variables .

1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”

Guide to Slowly Changing Dimensions [Intro + Type 1]

Guide to Slowly Changing Dimensions [Intro + Type 1]

Firstly what is a dimension?
A dimension is a structure that categorizes facts and measures which can be used to understand business requirements.

What is a Slowly Changing Dimension?
A Slowly Changing Dimension are dimensions that change slowly over time.

Why is it needed?
In Data Warehousing which deals with historical data, tracking of changes is important as it helps to give a better understanding of the Business.

An Example
Imagine having a customer dimension in a taxation department which holds records of address and so on,
and a customer changes his/her address it is important to track the address changes ie from old to new.

Types of Slowly Changing Dimension.
1. Type 1
2. Type 2
3. Type 3
4. Type 4
5. Type 5

1. Type 1:
This follows the technique of replacing the previous value, here no Historic data is/are kept, commonly used in correction or updating of records.

In this blog i would be showing you how to create a Type 1 Slowly Changing Dimension using Pentaho Data Integration:

1. Identify your source or create one if you have to.
CSV input

2. Load Data into the Database table
Load Data

Now Open a New transformation
3. Select a input component for your source in my case its a CSV input
4. Select a Get System Info input Component [Optional]
Double Click on it and
For Type: Select System Date (fixed)
For Name: load_data (Any Name)

5. If you are using a Get System info then this step is necessary
Select a Input Component Join Rows (Cartesian Product)
What this does is it makes a cross join of the new Load Date column to the Table or source table.
Select OK

6. Select a Select Values Input component,
this is used to add or remove or alter fields in the source before going to the target if required.

Note Make Sure all the components are connected as Shown below:
Before insert
Double Click on Select Values Component

Click on Get fields to select
Click on Meta-data and again select Get Fields to change
Click OK!

7. Select a insert/update Input component
Select your target Schema
Select your target Table

Now Select get fields
and remove the fields you don’t require while leaving the main key or primary key to look up values.
in my case client Id


Select Get Update Fields
And in the update column i will want Select N for the column i don’t want to update.
in my case again “client_id”

Now if selected the Get System info component to have a load date column then the below step is important

Remember we added load_date we want add that column

so Select SQL

look through the pre-defined query available and you will see a Alter Table to add column load_date is present
so all we need to do is Select Execute and run a query to see if your column has been added as shown below

load_date added

Click OK!

8. Now Go to your Source and edit a data and add a data also as shown below

and then Save it

I have added a new row and also editied Sohail’s region from NY to NJ

Now your all transformation should something like this with all the conncetions pointing to the right component

after all

Now Run the Transformation

Now after running successfully

Check your data via query

final data

As you can see data has being replaced/overwrite for Sohail [NY to NJ]
and also added 1 more new row Data.

In My next blog we will discuss Type 2 of Slowly Changing Dimension

Izebhijie Sohail Ehizogie

Handling Rollback and Database transactions in Pentaho Data Integration.

Handling Rollback and Database transactions in Pentaho Data Integration.



While inserting the data into the database table it often happens that when a job/transformation fails in between the data load the whole transaction gets rolled back.

In general, the table output step in PDI will insert and commit any number of records until there is an error. Consider an example where the ETL process processes 5000 records and the 3999 record has a string instead of an integer (eg. the DDL defines the field as Integer), the job/transformation will come to a sudden halt.

Now the problem is though, that we have 3,999 records in the DB table from this current ETL run.In some situations this might not be an issue, e.g. if you have a CDC process set up which can deal with this (e.g. it can just start off the highest id or timestamp), but in other scenarios, this situation might be a problem.

Pentaho Data Integration provides a transactional feature, which is really quite easy to use:Make sure your DB engine supports transactions! E.g. MySQL MyISAM does not support transactions. So the transformation will run and insert records even though there was an error.

Transactions in transformations: simply by enabling the “Make the transformation database transactional” option in the “Miscellaneous” tab of the transformation settings dialog.


Note: database transactions will not work in the following cases:

  • This will disable the Use batch update for insert option in the Table Output step also ignore the Commit Size setting. So effectively this will slow down the insert operation.
  • if you use database partitioning and multiple connections per step copy need to be created. Performance wise it doesn’t make any sense in that case anyway.

Use of Lateral Subqueries in PostgreSQL

Use of Lateral Subqueries in PostgreSQL

Lateral subqueries are subqueries appearing in FROM can be preceded by the key word LATERAL. This allows them to reference columns provided by preceding FROM items. (Without LATERAL, each subquery is evaluated independently and so cannot cross-reference any other FROM item.)

In simple words, Lateral keyword allows to pass columns values from the previous join to your subquery.

Note: This new feature available in Postgres version 9.3 and above.

For Example :

Lets take an example suppose you are joining two table (table1 and table2) which returns 10 rows after joining and you want to add 1 column which holds YTD value. for this new column you have written a subquery. lets say 100 rows returned by subquery for year 2015.

Table Name

employee = employee_id, company_id, emp_name
company = company_id, companyname
emp_sal = employee_id, month,year, salary

Subquery without Lateral Keyword.

select employee.emp_name, c.companyname ,subquery.ytdvalue, subquery.year
from employee e join company c on e.company_id = c.company_id
( select sum(salary) as ytdvalue ,year, e.employee_id
from employee e1 join emp_sal es on e1.employee_id = es.employee_id
group by e1.employee_id, year
) subquery on subquery.employee_id = e.employee_id
e.employee_id in (1,2,3,4,5,6,7,8,9,10)

Explaination: Above subquery runs 100 times for each individual id and then join with e.employee_id and returns 10 rows as final output.

Subquery with Lateral Keyword

select employee.emp_name, c.companyname ,lateralsubquery.ytdvalue, lateralsubquery.year
from employee e join company c on e.company_id = c.company_id
join lateral
( select sum(salary) as ytdvalue,year, e.employee_id
from employee e1 join emp_sal es on e1.employee_id = es.employee_id
e1.employee_id = e.employee_id

group by e1.employee_id, year
) lateralsubquery on lateralsubquery.employee_id = e.employee_id
e.employee_id in (1,2,3,4,5,6,7,8,9,10)
Explaination: Above lateral subquery runs 10 times for each individual id because we are passing e.employee_id into subquery.

For more information visit

Thanks and Regards
Fahad Anjum

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Connection Set-Up

Connection Setup for connection type: PostgreSQL

[We are now setting up a connection to A database i.e if your source is a database]
There are 3 ways of access provided for Connections using PostgreSQL
a. Native (JBDC)

a. Native (JBDC)
I. Enter the Connection Name: Anyname
II. Select the connection type: PostgreSQL
III. Host Name: localhost [This can be an ip address]
Database Name: LearnETL[Name of the database you are using]
Port Number: 5432 or your required Port Number
User Name: Your database user name
Password: Your database password
IV. Test Connections and OK.

Here we need to go to the data-integration folder and open up the sub-folder “simple-jndi” and edit

Here we need to write the following code:


ETL_Connection: name of the connection
localhost:5432/LearnETL: localhost is the host name, 5432 is the port number and LearnETL is the Database name.
user: username
password: Password

Save and back to the Database connection
Restart your PDI.

and in the Setup, select JNDI and for
JNDI Name : name of your connection [ETL_Connection]


This is not commonly used but what will be needed

are as follows:

1.Install the PostgreSQL ODBC driver which can be downloaded.
2.Select the PostgreSQL ODBC Unicode and then
3. Setup
enter Data source Source, Name, Server, Username, Password and Port. Test and Save if Connection is OK.!
4. a bundle of JAR files to be copied in your Java folder as this ODBC bundle files has been discontinued in JAVA 8 Bridge

Thank You
Sohail Izebhijie