Analytic Query in Pentaho Data Integration

Analytic Query in Pentaho Data Integration [LAG & LEAD]

The could be a time we get a requirement as follows:

Fetch Every Purchases Sum for each Customer in a year, having the following columns

customerid, customername, productcost, purchasemonth

Now this looks easy, but the tricky part would be fetching the


he/she made a purchase

If your using a PostgresDB then this will be easy because there are various windows functions that will help you in acquiring your result with ease

But if your DB doesn’t support Windows Functions then it could be a problem.

So now Pentaho will help you reach your result in very less hustle with the help of the Analytic Query Component

Now lets build our query for one customer

sf97.customer_id as CustomerID,
co.fullname as CustomerName,
SUM(sf97.store_sales) as ProductCost,
tbd.the_month as PurchaseMonth
from sales_fact_1997 sf97
JOIN time_by_day tbd on tbd.time_id = sf97.time_id
JOIN customer co on co.customer_id = sf97.customer_id
WHERE sf97.customer_id = 4873

ORDER BY sf97.customer_id,tbd.month_of_year

Now we Open Up our PDI,  Create a new Transformation,

  • Select Table Input as your Source Table
  • Next Select the Analytic Query


  • Open up Analytic query
  • Now from our requirement we want to fetch the PreviousPurchaseMonth he/she made a purchase for each customer (What month he/she previously made a purchase)
    so the unique item here is the customerID
  • Then in the Group Field Select the customerid
  • In the Analytic Functions
    Enter the name of your new field
    Subject will be the purchasemonth Column because our new column is dependent on that

Type: This can be Lag (Behind) or Lead (Forward)
Click ok

Now right click on Analytic Query Component and Select Preview -> Quick Lunch


There you go.

Note make Sure in your query you have and order by based on your requirement.

Ehizogie Sohail Izebhijie

What is analytics?

What is analytics?

Global Data Analytics is the fastest growing sector in the word in terms of almost all types of industries be it in the field of finance, E-commerce, retail, sports, telecom or health. In today’s competitive world where everyone is so busy and impatient and wants wealth and success in no time, data analytics serves as boon.

Well, the way to take your business to the heights of success somewhere goes through analysis of company’s past records in every aspect. E.g In education institutes the admission cell always go through the past results of the candidate to judge his performance in the future. Similarly in retail store the selling count of product is useful in predicting the market growth relative to the particular product. So, no matter whatever is the industry we are analysing data in one or other way to increase the business and growth in market.

Discussing more technically we can categorise the analytics broadly in three categories:-

A) Descriptive Analysis: As the name suggests it has something to do with the description , the history like what has happened in the past. It deals with data records in files or store. The data is summarised, categorised, filtered and cleaned as per its application and targeted to the particular group of users.

B) Predictive Analysis: In this we use the statistics and calculations over old data to predict the future. Often the judgement given after observing the regular trend and pattern proves to be correct in 95% cases. You can actually play the role of astrologers, predicting future by being keen observant of data patterns and behaviours extracted from lakhs of records stored in data basis on the basis of reports generated after descriptive analysis.

By optimising and simulating the first two types of analysis the power to prescribe or suggest remedies comes in hand of Data Analysts. Now, not only prediction about what can happen but what is going to happen in different cases and what can be done as preventive measures is also known beforehand with the help of analytics. Risk analysis and risk management can be derived from same.
Market strategies and rise and fall of various industries are just the part and parcel of proper analysation of the available data.

Nisha Sahu

Data Analysis Techniques

Data Analysis Techniques    

The need and way to analyse data basically depends on the end and not on the source.That end is typically the need to perform analysis and decision making through the use of that source of data. Data analysis in these days typically include reporting, multidimensional analysis and data mining which relates to “Display”, “Analyse” and “Discover” respectively. Depending on the type of data analysis the source data’s requirement may vary.

     If reporting is required for analysis then simplest of the data source would give best results.Query and reporting capability primarily consists of selecting associated data elements, perhaps summarizing them and grouping them by some category, and presenting the results. Retrieving relevant data from the data warehouse, transforming it into the appropriate context, and displaying it in a readable format.Finally, the report is delivered to the end user in the desired output format be it graph, pie and table form in the required output medium.

     If the objective is to perform multidimensional data analysis, a dimensional data model would be more appropriate. This type of analysis requires that the data model support a structure that enables fast and easy access to the data on the basis of any of numerous combinations of analysis dimensions. For example,you may want to know how many of a specific product were sold on a specific day, in a specific store, in a specific price range.Multidimensional analysis enables users to look at a large number of interdependent factors involved in a business problem and to view the data in complex relationships. End users are interested in exploring the data at different levels of detail, which is determined dynamically. The complex relationships can be analyzed through an iterative process that includes drilling down to lower levels of detail or rolling up to higher levels of summarization and aggregation.This is a data analysis operation whereby the user takes a different viewpoint than is typical on the results of the analysis, changing the way the dimensions are arranged in the result. Like query and reporting,multidimensional analysis continues until no more drilling down or rolling up is performed.

     As said before Data mining is nothing but “Discovery”. This discovery could take the form of finding significance in relationships between certain data elements, a clustering together of specific data elements,or other patterns in the usage of specific sets of data elements. After finding these patterns, the algorithms can infer rules. These rules can then be used to generate a model that can predict a desired behavior, identify relationships among the data, discover patterns, and group clusters of records with similar attributes.Data mining is most typically used for statistical data analysis and knowledge discovery. Statistical data analysis detects unusual patterns in data and applies statistical and mathematical modeling techniques to explain the patterns. Data mining is data driven . There is a high level of complexity in stored data and data interrelations in the data warehouse that are difficult to discover without data mining. Data mining offers new insights into the business that may not be discovered with query and reporting or multidimensional analysis. Data mining can help discover new insights about the business by giving us answers to unasked questions .

     These data analysis techniques offers new insights into the business through keen look into the data for its analysis to fetching quality information, which can be be used for ultimate buisness intelligence through profitable buisness decisions in order to grace buisness upliftment and growth in different sectors.

Business Intelligence Vs Business Analytics



What’s the difference between Business Analytics and Business Intelligence? The correct answer is: everybody has an opinion, but nobody knows.

For example, when SAP says “business analytics” instead of “business intelligence”, it’s intended to indicate that business analytics is an umbrella term including data warehousing, business intelligence, enterprise information management, enterprise performance management, analytic applications, and governance, risk, and compliance.

But other vendors (such as SAS) use “business analytics” to indicate some level of vertical/horizontal domain knowledge tied with statistical or predictive analytics.

At the end of the day, there are two things worth differentiating:

  1. The first is the business aspect of BI — the need to get the most value out of information. This need hasn’t really changed in over fifty years (although the increasing complexity of the world economy means it’s ever harder to deliver). And the majority of real issues that stop us from getting value out of information (information culture, politics, lack of analytic competence, etc.) haven’t changed in decades either.
  2. The second is the IT aspect of BI — what technology is used to help provide the business need. This obviously does change over time —sometimes radically.


Quick Comparison


We know that Analytics refers to the skills, technologies, applications and practices for continuous iterative exploration and investigation of data to gain insight and drive business planning. Analytics consists of two major areas: Business Intelligence and Business Analytics.

What is often overlooked is how the two differ based on the questions they answer:

  1. Business Intelligence : traditionally focuses on using a consistent set of metrics to measure past performance and guide business planning. Business Intelligence consists of querying, reporting, OLAP (online analytical processing), and can answer questions including “what happened,” “how many,” and “how often.”
  2. Business Analytics : goes beyond Business Intelligence by using sophisticated modelling techniques to predict future events or discover patterns which cannot be detected otherwise. Advanced Analytics can answer questions including “why is this happening,” “what if these trends continue,” “what will happen next” (prediction), “what is the best that can happen” (optimization).


The Evolution of Business Intelligence vs Business Analytics

In the past, BI has been used to talk about the people, processes and applications used to access and extrapolate meaning from data, for the sake of improving decisions and understanding the effectiveness of targeted decisions. But this is where BI as a baseline failed; something that runs entirely off of static, historic data severely limits a user’s ability to make predictive decisions and forecast for the future market. When an emergent situation arises on a Friday afternoon, the user doesn’t greatly benefit from looking at metrics collected prior to the introduction of that situation.

The rapid growth and demand for BA comes from this failing, and is in a way the evolved form of BI solutions. In a business world whose speed is ever-increasing, the user needs to be able to interact with information at the speed of business, not looking back over his or her shoulder at what happened in the past. BI setups alone do not support the occurrence of users asking and answering questions in the face of marketplace events as they happen. A company that is data-driven sees their data as a resource, and uses it to hedge out competition. The more current the data the user has, the better jump he or she has on the competitor, who may or may not have become a threat in a time so recent that traditional BI data reporting wouldn’t even take them into consideration.

Many companies are commonly implementing advanced analytics on top of their data warehouses, to bridge the gap between BI and current day needs. Perhaps this is the origin of the confusion between terms, as organizations pick and choose from different combinations of services and have no real understanding of what to call these mashups.

Equally relevant is the fact that more and more people are being asked to interpret data in roles that are not strictly analytical. Product managers, marketers and researchers are moving towards data as a way to formulate strategies, and traditional BI platforms make it difficult to push data into real-time situations and what-if scenarios.  With the importance of data-driven decisions increasingly becoming a realization for less tech-savvy branches of company teams, the need for more user-friendly and faster producing platforms also grows. Moreover, delivering the data that supports these decisions to a broader company team demands a more visual form of modeling tool, to improve understanding across all departments. Charts and graphs showing BA findings are quicker and more impacting than written out statistics and excel sheets full of data.

Data interpretation and the manipulation method of choice change as the market demands. While having a set of established methods is important to the effectiveness of a company’s strategy, it’s understanding the need for flexibility in the face of these changes that can be a company’s most valuable asset.

To summarize,

WHAT is happening to your business = Business Intelligence (For Visibility)
WHY it is happening, WHAT WILL likely happen in future = Business Analytics (For Investigation, Prediction & Prescription).



About Analytical Databases

Analytics-iconAbout Analytical Databases 

Companies and Corporates that build Business Intelligence solutions for their business, often want to consolidate their data that is spread across various applications, for reporting and analysis purposes. For that, they build datawarehouses or data marts as the foundation of their enterprise-wide BI solutions. These DWs have been implemented using traditional relational databases and successfully so. But as organizations add more and more data and users to these systems, the scabiltity and performance become crucial. One of the useful contributors to solving the problems of scability / performance could be the use of Analytical Databases or Datawarehousing DBs.


What are these Analytical Databases ?

Analytics databases are optimized to be used for retrieval, aggregation and analytics. They deliver benefits in a BI environment because they are designed for high performance in this environment.

These Analytical databases, or analytical platforms, span a range of technology from appliances and columnar databases to shared nothing, massively parallel processing databases. The common thread among them is that most are read-only environments that deliver exceptional price/performance compared with general-purpose relational databases originally designed to run transaction processing applications. Analytical databases are now a central component in any BI ecosystem, providing the necessary performance and scalability to support large numbers of analytical users and growing volumes of both structured and unstructured data.

Below are a few differences between transactional and analytical DBs:

  • On-line transactional procesing (OLTP) databases are designed to host operational systems that run the business. Analytics databases step in when operational use of the data has ended, and the data are to be used for information and analysis.
  • OLTP-oriented databases, are designed for high performance in an on-line transactional environment, where additions, updates, and deletions are being continuously made to individual or small numbers of records in a random manner over time, to data scattered accross tables. In an analytics database, there are mostly bulk uploads of data, deletions and update are rare / infrequent.
  • Data integrity is critical to transactional data, OLTP databases are optimal for these.
  • OLTP databases support environments where many concurrent users are retrieving, adding, updating, and deleting data on realtime basis. Analytics databases, on the other hand, support environments where very many concurrent users are retrieving data but adds, updates, and deletes are done periodically, on scheduled basis.
  • In OLTP environments, retrievals are oriented towards single or few records where many or most columns from these records are required. In analystics databases, retrievals are oriented towards many or most records (statistical operations); few columns from these records are required.


What makes Analytical DBs faster and more scalable ?

Let’s look at some of the technologies I mentioned above to see how these make analytical DBs different. These are some but not all of the technologies that are used.

Columnar Databases :

A columnar database, also known as a column-oriented database, is a database management system (DBMS) that stores data in columns rather than in rows as relational DBMSs. The main difference between a columnar database and a traditional row-oriented database are centered around performance, storage necessities and schema modifying techniques. The goal of a columnar database is to efficiently write and read data to and from hard disk storage and speed up the time it takes to return a query. One of the main benefits of a columnar database is that data can be highly compressed allowing columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly. Another benefit is that because a column-based DBMS is self-indexing, it uses less disk space than a relational database management system (RDBMS) containing the same data.

Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.

Massively Parallel Processing:

MPP (massively parallel processing) is the coordinated processing of a program by multiple processors that work on different parts of the program, with each processor using its own operating system and memory. An MPP system is also known as a “loosely coupled” or “shared nothing” system. MPP allows the warehouse to break up large data analytics jobs into smaller, more manageable chunks, which are then distributed to multiple processors. Analytics is run simultaneously – or in parallel – on each processor, then the results returned and synthesized.

In Memory analytics:

I don’t think I really need to elaborate on this one. They are databaes which load the source data into system memory in a compressed, non-relational format in an attempt to streamline the work involved in processing queries.

Online analytical processing (OLAP):

These are the databases, which store multidimensional “cubes” of aggregated data for analyzing information based on multiple data attributes.


This is a little bit about analytical databases and how they might add value to a DW/BI Solution.

Shraddha TambeHelical IT Solutions

Slide share – determine right analytic db

Anonymous Authentication in Pentaho

This blog will be talking about anonymous authentication in Pentaho. You can bypass the built-in security on the BA Server by giving all permissions to anonymous users. An “anonymousUser” is any user, either existing or newly created, that you specify as an all-permissions, no-login user, and to whom you grant the Anonymous role. The procedure below will grant full BA Server access to the Anonymous role and never require a login.

1. Stop the BA Server.
2. Open the /pentaho/server/biserver-ee/pentaho-solutions/system/applicationContext-spring-security.xml file and ensure that a default anonymous role is defined. Match your bean definition and property value to the example below.

<bean id=”anonymousProcessingFilter” class=””>

<!– omitted –>

   <property name=”userAttribute” value=”anonymousUser,Anonymous” />



3. Find these two beans in the same file .
o filterSecurityInterceptor
o filterInvocationInterceptorForWS
Locate the objectDefinitionSource properties inside the beans and match the contents to this code example.

<bean id=”filterInvocationInterceptor” class=””>
    <property name=”authenticationManager”>
        <ref local=”authenticationManager” />
    <property name=”accessDecisionManager”>
        <ref local=”httpRequestAccessDecisionManager” />
    <property name=”objectDefinitionSource”>
\A/.*\Z=Anonymous,Authenticated ]]> </value>


4. Save the file, then open pentaho.xml in the same directory.
5. Find the anonymous-authentication lines of the pentaho-system section, and define the anonymous user and role.

<!– omitted –>
    </anonymous-authentication> <!– omitted –>

6. Open the file in the same directory.

a) Find the singleTenantAdminUserName and replace the value with the anonymousUser name.
b) Find the singleTenantAdminAuthorityName and replace the value with Anonymous.
c) Save the file.

Open the pentahoObjects.spring.xml file.
Find all references to the bean id=”Mondrian-UserRoleMapper” and make sure that the only one that is uncommented (active) is this one:

<bean id=”Mondrian-UserRoleMapper”
    <property name=”sessionProperty” value=”MondrianUserRoles” /> </bean>

Save pentahoObjects.spring.xml and close the file.
Restart BA Server.
Enter http://localhost:8080/pentaho in browser address field. You will find that the pentaho home page opens without requiring login.

Archana Verma
Helical IT Solutions

How to solve pg_tblspc problem in postgres

This blog will talk about How to solve pg_tblspc problem in postgres

pg_tblspc problem postgres

Sometime when we are doing some operation on Postgres database, we faced this error.

I faced  it,  when I  was  trying  to  get  values  from  database.

This issue throws error that could not open directory “pg_tblspc/<1234…x>”: No such file or directory

To  solve this  issue  you  first  have  to  create  dump  of  your  database. Steps are ahead …


  • psql -h localhost -p 6666 -U postgres -d <yourDBname>-f “E:/xyz/abc.p”

(It will create dump file of your database)

  • Then delete your old database after dumping it.

(dump database <yourDBname>)

  • Then again restore it from your dump file…


Or, Its better to use your backup file while restoring..

If  problem  is  still  there then we need  to first restore that <abc.p/abc.sql> file  on  our  local  machine

  • then  again  create  (.p)  file  as  dump  file
  • And restore your postgres by using this new dump file…

You will not come across this problem again……..


Another way:-

1) Backup data with pg_dump


pg_dump -i -h localhost -p 5432 -U postgres -F c -b -v -f

“/usr/local/backup/” old_db


To list all of the available options of pg_dump, please issue following command.


  • Ø pg_dump -?


-p, –port=PORT database server port number

-i, –ignore-version proceed even when server version mismatches

-h, –host=HOSTNAME database server host or socket directory

-U, –username=NAME connect as specified database user

-W, –password force password prompt (should happen automatically)

-d, –dbname=NAME connect to database name

-v, –verbose verbose mode

-F, –format=c|t|p output file format (custom, tar, plain text)

-c, –clean clean (drop) schema prior to create

-b, –blobs include large objects in dump

-v, –verbose verbose mode

-f, –file=FILENAME output file name


2) Restore data with pg_restore


pg_restore -i -h localhost -p 5432 -U postgres -d old_db -v



To list all of the available options of pg_restore, please issue following command.


  • Ø pg_restore -?


-p, –port=PORT database server port number

-i, –ignore-version proceed even when server version mismatches

-h, –host=HOSTNAME database server host or socket directory

-U, –username=NAME connect as specified database user

-W, –password force password prompt (should happen automatically)

-d, –dbname=NAME connect to database name

-v, –verbose verbose mode


Pushpraj Kumar,

Helical IT Solutions