Recursive With Query in Postgres

Recursive with Queries in Postgres

Sometimes we might have a requirement for certain hierarchy relation in data.
This can be done with the help of Recursive WITH Queries(Recursive CTE) in Postgres.

Consider the example –
Location is stored in table location_hierarchy.Query output will have all the children of ‘India’ along with its depth relative to “India”. The result should look somewhat similar to this:

Sample Data:

location data

In such cases, Self JOIN query will fail since the depth here is arbitrary and may change in future.

For this, we use a special query called WITH RECURSIVE. It is also called as PostgreSQL hierarchical query since it is usually used to query on hierarchical data, like the one which we are discussing. The WITH RECURSIVE actually is an extension of WITH query which is referred to as Common Table Expressions(CTE) in PostgreSQL. WITH query can be seen as forming a temporary table(s) which has a scope for a single query or as a named sub-query.

Query :
SELECT child, 1 AS depth ---|Non
FROM locatio_hierarchy --|Recursive
WHERE parent = 'India' ---|Part


SELECT a.child, depth+1 ---|Recursive
FROM locatio_hierarchy a --|Part
JOIN children b ON(a.parent = b.child) ---|
SELECT * FROM children


Query Output

Prasad Kshirsagar




Cassandra Installation and Cluster Setup on Ubuntu

Cassandra Installation and Cluster Setup

Cassandra Introduction

Apache Cassandra is a free and open-source distributed NoSQL database management system. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Cassandra prerequisites

Minimum hardware requirements:

CPU: 2 cores


Software requirements:

Java 8

Installation and Cluster Setup Steps

To setup a cassandra cluster, we need to install cassandra on each server which will be the part of the cluster.

Here, we are setting up 4 node cassandra cluster.

Suppose we are having 4 nodes, namely,





Note: In order to set up a cluster, perform the following steps on each involved server.


  1. Install Java 8 on each server using following commands on the terminal. (Optional if already installed)
    1. To make the Open JDK package available, you’ll have to add a Personal Package Archives (PPA) using this command.
      sudo add-apt-repository ppa:openjdk-r/ppa
    2. Update the package database.
      sudo apt-get update
    3. Now install the Java 8.
      sudo apt-get install openjdk-8-jdk
  2. Install cassandra on each servers using the following list of commands on the terminal. Here we are installing Cassandra version 3.10.
    1. echo "deb 310x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

      Note 310x means we are installing cassandra of version 3.10. You can change this value if you want to install different version of Cassandra.

    2. Add the Apache Cassandra repository keys using following command.
      curl | sudo apt-key add -
    3. Update the package database using following command.
      sudo apt-get update

      Note If you encounter any public key related error, such as “The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA”.

      Then add the public key A278B781FE4B2BDA as follows:

      sudo apt-key adv –keyserver –recv-key A278B781FE4B2BDA

      Now repeat ‘sudo apt-get update‘. The actual key may be different, you get it from the error message itself.

    4. Now install the Cassandra using following command.
      sudo apt-get install cassandra

      By now, you have installed cassandra on each involved server.
      Go through the following steps to setup cluster.

Cluster Setup Steps

  1. Stop cassandra service on each server (XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3,XXX.XX.XX.4) using following command.
    sudo service cassandra stop
  2. Delete default cassandra data from each server located in data directory of cassandra using following command.
    sudo rm -rf /var/lib/cassandra/data/system/*
  3. Now open the cassandra.yaml file located in /etc/cassandra directory on each server one after another to configure cluster.
    Note Following are the parameters from cassandra.yaml file that needs to be configured in order to setup cluster.
    cluster_name : You change the cluster name by setting this parameter. It is optional.
    -seeds : This is a comma-delimited list of the IP address of each node in the cluster.
    listen_address : This is IP address that other nodes in the cluster will use to connect to this one. It defaults to localhost and needs changed to the IP address of the node.
    rpc_address : This is the IP address for remote procedure calls. It defaults to localhost. If the server’s hostname is properly configured, leave this as is. Otherwise, change to server’s IP address or the loopback address (
    endpoint_snitch : Name of the snitch, which is what tells Cassandra about what its network looks like. This defaults to SimpleSnitch, which is used for networks in one datacenter. In our case, we’ll change it to GossipingPropertyFileSnitch, which is preferred for production setups.
    auto_bootstrap : This directive is not in the configuration file, so it has to be added and set to false. This makes new nodes automatically use the right data. It is optional if you’re adding nodes to an existing cluster, but required when you’re initializing a fresh cluster, that is, one with no data.

    1. Open cassandra.yaml on XXX.XX.XX.1 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.1
       rpc_address: XXX.XX.XX.1
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    2. Open cassandra.yaml on XXX.XX.XX.2 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.2,XXX.XX.XX.1,XXX.XX.XX.3,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.2
       rpc_address: XXX.XX.XX.2
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    3. Open cassandra.yaml on XXX.XX.XX.3 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.3,XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.4"
       listen_address: XXX.XX.XX.3
       rpc_address: XXX.XX.XX.3
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
    4. Open cassandra.yaml on XXX.XX.XX.4 node.
      vi /etc/cassandra/cassandra.yaml

      Search and set the parameters as follows,

      cluster_name: ‘Helical’
       - seeds: "XXX.XX.XX.4,XXX.XX.XX.1,XXX.XX.XX.2,XXX.XX.XX.3"
       listen_address: XXX.XX.XX.4
       rpc_address: XXX.XX.XX.4
       endpoint_snitch: GossipingPropertyFileSnitch
       auto_bootstrap: true
  4. Start the Cassandra daemon on each server on each server.
    sudo service cassandra start

    Note You can check the status of cassandra using following command.

    sudo service cassandra status
  5. Check the status of the cassandra cluster using nodetool utility command. Fire command on one node say XXX.XX.XX.1 as follows,
    sudo nodetool status

    You will get output like,

    Datacenter: DC1
    |/ State=Normal/Leaving/Joining/Moving
    -- Address     Load       Tokens Owns (effective) Host ID                              Rack
    UN XXX.XX.XX.1 30.94 GiB  256    100.0%           ae45c3c5-30e3-4d60-9e5c-3f0004bbfea4 rack1

    Note You’ll find that only the local node is listed, because it’s not yet able to communicate with the other nodes.

  6. Allow communication by opening the following network ports on each node using the following commands.
    sudo ufw allow 7000

    7000 is the TCP port for commands and data.

    sudo ufw allow 9042

    9042 is the TCP port for the native transport server. cqlsh, the Cassandra command line utility, will connect to the cluster through this port.
    Note You can change the port number as required by changing it in cassandra.yaml
    storage_port: 7000
    native_transport_port: 9042

  7. Now Check the status of the cassandra cluster again.
    sudo nodetool status

    Output would be like this,

    Datacenter: dc1
    |/ State=Normal/Leaving/Joining/Moving
    -- Address     Load       Tokens Owns (effective) Host ID                              Rack
    UN XXX.XX.XX.3 31.3 GiB   256    100.0%           20412cb7-b9bf-4373-bf6f-d0958a993c95 rack1
    UN XXX.XX.XX.4 30.78 GiB  256    100.0%           a49ee892-f816-4e0d-84b4-3b3bf1421f33 rack1
    UN XXX.XX.XX.1 30.94 GiB  256    100.0%           ae45c3c5-30e3-4d60-9e5c-3f0004bbfea4 rack1
    UN XXX.XX.XX.2 32.5 GiB   256    100.0%           f9084fd5-06f4-4408-ba90-32c9afd60ede rack1

You can see that all the nodes are up and normal.

In this way, you can install and setup cassandra cluster.


In this blog, we have discussed how can we install and setup the cassandra cluster.



Map Reduce in Hadoop :

Map Reduce in Hadoop :

It is designed for processing large volume of data in parallel.

It is an execution model in hadoop framework which is sub-divided into two separate phases :

  1. Mapper phase,
  2. Reducer Phase

Mapper Phase : During this phase , the input data splits for analysis by map tasks running in parallel across the hadoop cluster. It separate required output key and output value and writes into local disk.

Reducer Phase : It has two responsibility :

  1. Grouping the data based on key
  2. Aggregation

Once   output is returned , immediately mapper output will be deleted.

In our query or job,if suppose there is no requirement of grouping and aggregating functionality , we can suspend the reducer . In such situation , mapper output is permanent.

For the mapper and reducer , input and output should be in key value pair.

Identity Mapper : it is like identity function in mathematics.

Identity Mapper takes the input key/value pair and splits it out without any processing.

Identity Reducer :

In Identity Reducer , the reduce step will take place , related sorting and shuffling will also be performed.but there will be no aggregation .

So if we want to sort our data that is coming from map but don’t care for any grouping and also fine with multiple reducer output then in that case we can use identity reducer.

 Combiner in Map reduce :

Combiner is used as an optimization for map reduce job.The combiner function runs on the output of the map phase and is used as filtering or aggregating step to lessen the number of intermediate keys that are being passed to the reducer.In most of the cases , the reducer is set to be the combiner class. The output of the combiner class is the intermediate data that is passed to the reducer where as the output of the reducer class is passed to the output file on disk.


Rupam Bhardwaj


Different types of models used by databases

There are a lot of databases out there in the market, lets understand the models types and the corresponding usage.

Different types of models used by databases
• Relational DBMS                          • Wide column stores                                   • Content stores
• Document stores                          • Event stores                                                • Graph DBMS
• Key-value stores                          • Multivalue DBMS                                        • Native XML DBMS
• Navigational DBMS                      • Object Oriented DBMS                             • RDF Stores
• Search Engines


Relational databases: It is a collection of data items organized in set often called as tables. All the tables are related to each other. SQL statements are used to fetch, update, and delete the information from these tables.
Popular relational databases are
Oracle                                                                         • MySQL
Microsoft SQL Server                                               • PostgreSQL


Wide column stores: This type of databases stores data in a form of records with an ability to hold any number of dynamic columns. This type of databases doesn’t follow a specific schema. You can think of storing of the data in two-dimensional key-value.
Popular wide column stores databases are
Cassandra                                               • HBase                                                       • Accumulo


Content stores: They are also called as content repositories, specialized in management of digital content, such as text, pictures or videos, including their metadata. Some of the features included are full text search, versioning, hierarchical structured content, and access control.
Popular content stores databases are
Jackrabbit                                                                • Modeshape


Document stores: They are also called as document oriented databases systems, and characterized by their schema-free organization of data. They store records and each record may have different columns. Document stores often use internal notations, which can be processed directly in applications, mostly JSON.
Popular document stores databases are
MongoDB                                 • CouchDB                                                       • Couchbase


Event stores: They persists all state changing events for an object together with a timestamp, thereby creating time series for individual objects. The current state of an object can be inferred by replaying all events for that object from time 0 till the current time.
Popular event stores are
InfluxDB                                          • Event Store


Graph DBMS: Graph DBMS, also called graph-oriented DBMS or graph database, represent data in graph structures as nodes and edges, which are relationships between nodes. They allow easy processing of data in that form, and simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node.
Popular graph event database are
Neo4j                                                • Titan


Key-value stores: They can only store pairs of keys and values, as well as retrieve values when a key is known. One of the examples of the key-value stores is mentioned below
Key Value
Student1_maths 100
Student1_english 80
Student2_maths 100

Popular Key-value stores are
DynamoDB                                           • Redis                                                       • Memcached


Multivalue DBMS: It is also called as multidimensional database. It is very similar to Relational databases, however it differ from relational databases in that they have features that supports to use of attributes of values, rather than all attributes being single-valued.
Popular Multivalue databases are
Adabas                       • D3                          • UniData,Universe                                 • jBASE


Native XML DBMS: This type of databases internal data model corresponds to XML document. Native XML DBMS do not necessarily store data as XML documents, they can use other formats for better efficiency.
Generally defines a logical model for an XML document – as opposed to the data in that document – and stores and retrieves documents according to that model. Has an XML document as its fundamental unit of storage.
Popular Native XML DBMS
MarkLogic                    • Sedna


Navigational DBMS: This type of databases allows access to data sets only via linked records. They were the first established systems able to manage large amounts of data.
Popular Navigational DBMS
IMS                              • IDMS


Object Oriented DBMS: This type of databases stores the information in the form of objects as used in the object oriented programming. Object oriented databases are different from relational databases which are table oriented.
An object oriented DBMS follows an object oriented data model with classes (the schema of objects), properties and methods. An object is always managed as a whole. This means for example, that the insertion of an object, which in a relational system would probably be stored in multiple tables, will be performed automatically as one atomic transaction – without any action by the application program. Reading an object can also be done as a single operation and without complex joins.
There are tools and architectures that are now provided for the storage of objects into relational databases (such as Hibernate or JPA).
Popular object oriented databases are
Cache                                         • ObjectDB


RDF Stores: The Resource Description Framework (RDF) is a methodology for the description of information, originally developed for describing metadata of IT resources. Today it is used much more generally, often in connection with the sematic web, but also in other applications.
The RDF model represents information as triples in the form of subject-predicate-object.
Database management systems, which are able to store and process such triples, are called RDF stores or triple stores.
Popular RDF stores are
AllegroGraph                               • Jena


Search Engines: Search engines are NoSQL database management systems dedicated to the search for data content. In addition to general optimization for this type of application, the specialization consists in typically offering the following features:
• Support for complex search expressions
• Full text search
• Stemming (reducing inflected words to their stem)
• Ranking and grouping of search results
• Geospatial search
• Distributed search for high scalability
Popular search engines are:
Elasticsearch                               • Solr                                           • Sphinx

Getting Started with Mongo DB

Installation & Startup:

Download MongoDB installer for windows platform from and run. This simply extracts the binaries to your program files.

#Create DBPATH and log libraries:

Allocate a folder in your system that can be used for holding the mongo databases and also allocate a log file.

Ex – Allocated “C:\mongo\data\db” for databases and “C:\mongo\logs\mongo.log” as a log file.

#Starting the mongo database

Below are different ways of starting the mongodb:

1.    From the command prompt

Execute the mongod.exe present in the bin folder to start the database.

On command prompt à mongod --dbpath c:\mongo\data\db

There are other options that can also be specified alongwith dbpath. If dbpath is not provided, it looks for c:\data\db folder and gives error if not found.

To shutdown, press CTRL+C


2.    Starting with a config file

You can create a configuration file to define settings for the MongoDB server like the dbpath,logpath etc. Below is a sample file :

(This is a older format, for 2.6 version a new format is introduced. Older format is supported for backward compatibility)

#This is an example config file for MongoDB

dbpath = C:\Mongo\data\db

port = 27017

logpath = C:\Mongo\logs\mongo.log

Now you can use the below command –

C:\Program Files\MongoDB 2.6 Standard\bin>mongod --config mongo.conf

2014-04-15T10:27:18.883+0530 log file "C:\Mongo\logs\mongo.log" exists; moved to


As we haven’t specified “logappend” option in the config file, it allocates new file everytime you start the db. You can check the log file if you are getting errors while connecting to the db

To shutdown, use command “mongod –shutdown”


3.    Installing as Windows service:

Start the command prompt as administrator

You can use the below command to create the service, edit the same as per your settings:

sc create MongoDB binPath= "\"C:\Program Files\MongoDB 2.6 Standard\bin\mongod.exe\" --service --config=\"C:\Program Files\MongoDB 2.6 Standard\bin\mongo.conf\"" DisplayName= "MongoDB 2.6 Standard"

Please note this is a single line of command

You can now simply start/stop the service to start/shutdown the mongo database.


Using Mongo command shell:

Run Mongo.exe from \bin folder and you will see the below:

MongoDB shell version: 2.6.0

connecting to: test      //This is the Default database

Welcome to the MongoDB shell.

For interactive help, type "help".

For more comprehensive documentation, see

Questions? Try the support group


Some basic commands to get you started

> show dbs                  // show databases

admin  (empty)
local  0.078GB

> use names               // switch to a particular database/creates one if it does not exist

switched to db names

> db.mynames.insert({name: 'shraddha', email: '[email protected]'})           // Inserting document
WriteResult({ "nInserted" : 1 })

//Note that , ‘db’ points to the current database in use. Here, Collection “mynames” is automatically created when you insert a document

> show dbs

admin  (empty)
local  0.078GB
names  0.078GB

> db.mynames.find()               //query the db, select operation

{ "_id" : ObjectId("534cbfd03dfb3fbd86d8029d"), "name" : "shraddha", "email" : "[email protected]" }

//One more way of inserting……

> a={"name":"test3","email":"test3.other"}

{ "name" : "test3", "email" : "test3.other" }

> b={"name":"test4",email:"test4.other"}

{ "name" : "test4", "email" : "test4.other" }

> db.othernames.insert(a)

WriteResult({ "nInserted" : 1 })

> db.othernames.insert(b)

WriteResult({ "nInserted" : 1 })

> db.othernames.insert(c)

2014-04-15T19:40:24.798+0530 ReferenceError: c is not defined

//…In all the above inserts, the “_id” which has the unique key is auto-generated..


> coll=db.mynames


> coll.find()

{ "_id" : ObjectId("534cbfd03dfb3fbd86d8029d"), "name" : "shraddha", "email" : "[email protected]" }
{ "_id" : ObjectId("534d3b89f4d4b90697c205d6"), "name" : "test1", "email" : "test1.helical" }

> coll=db.othernames


> coll.find()

{ "_id" : ObjectId("534d3dc3f4d4b90697c205d7"), "name" : "test3", "email" : "test3.other" }
{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }


> coll.find({name:{$gt:"test3"}})                  //find documents where “name” is >”test3”

{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }

> coll.find({name:"test3"})

{ "_id" : ObjectId("534d3dc3f4d4b90697c205d7"), "name" : "test3", "email" : "test3.other" }


> coll.find({$or:[{name:{$gt:"test3"}},{name:"test3"}]})

{ "_id" : ObjectId("534d3dc3f4d4b90697c205d7"), "name" : "test3", "email" : "test3.other" }
{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }

> coll.find({$or:[{name:{$gt:"test3"}},{name:"test0"}]})

{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }

//Example - Manually inserting ObjectID field (key value)
> coll=db.testobjs


> coll.insert({_id:1,fld1:"abc",fld2:123})

WriteResult({ "nInserted" : 1 })

> coll.insert({_id:2,fld1:"cde",fld2:345})

WriteResult({ "nInserted" : 1 })

> coll.insert({_id:2,fld1:"cde",fld2:345})       //trying to insert duplicate value in _id

"Inserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: names.testobjs.$_id_  dup key: { : 2.0 }"}}

> coll.find()

{ "_id" : 1, "fld1" : "abc", "fld2" : 123 }
{ "_id" : 2, "fld1" : "cde", "fld2" : 345 }


Importing a csv file into mongodb:

Alter the below command as per your requirement and execute:

C:\Program Files\MongoDB 2.6 Standard\bin>mongoimport --db northwind --collection orders --type csv --file C:\Shraddha\Official\MongoDB\northwind-mongo-master\orders.csv --headerline

connected to:
2014-04-17T18:24:22.603+0530 check 9 831
2014-04-17T18:24:22.604+0530 imported 830 objects


Options used –

–db : name of the database
–collection : orders
–type : type of input file (we can also import tsv, JSON)
–file : path of the input file
–headerline : signifies that the first line in the csv file is column names


Shraddha Tambe

Helical IT Solutions