Get the Row Count in PDI Dynamically

Get the Row Count in PDI Dynamically

Often people use the data input component in pentaho with count(*) select query to get the row counts. And pass the row count value from the source query to the variable and use it in further transformations.The more optimised way to do so can be through the built in number of options available in the pentaho. One of such component available in the penatho is ‘Memory Group By’.

In this blog, I’ll discuss how we can use the ‘Memory Group By’ component available in the pentaho to count the rows.

Here is an example, Suppose you have a ready query to pull records from the database then what you can do is pass the record value coming from the input component to the memory group by component as shown below:

Memory_GRoup_by1

Get the ‘Memory Group by’ component from: Design->Statistics-> Memory Group by

Memory_GRoup_by2

After that pass the count records from ‘Memory Group by’ component to the ‘Set Variables’.
Click on the ‘Memory Group by’ and in ‘Aggregates’, write the name as ‘CR_count’
In subject put the column name on which the records should be grouped and give the final count.
In type, from the drop down list select the option shown below as ‘Number of rows (without field argument)’ to get the records count.

Memory_GRoup_by3

Pass the output from memory group by option to the Get Variables, there configure the variable with the same name as you wrote in ‘Memory Group by’ i.e ‘CR_count’ overhere we have used.

Memory_GRoup_by4

Finally, we can use the records count via variable throughout the job by accessing the variable.

Thanks,
Nisha Sahu

CROSSTABS in Pentaho Report Designer and few limitations — EXPERIMENTAL FEATURE


CROSSTABS in Pentaho Report Designer (EXPERIMENTAL FEATURE)

Crosstabs are easy and relatively easy way to visualize tabular data along two or more axis. Crosstabs are still experimental feature, and as such are not up to mark in BI reports. Alhough they are already used to create some really useful reports, they still have parts which are to be handled in query as it expects the normalized data.

1. Data has to be sorted to match the group structure.
2. NULL data must not be present.
3. Aggregations in cross tabs are not flexible.

1. Data has to be sorted :
Data has to be sorted in the query of the crosstab in the order followed Group, Rows, Columns. If they are not sorted it pops up an error saying data is unsorted and the reports are not generated.

2.NULL data must not be present:
No NULL values are allowed in the data. The consequent group data is moved to the present group which displays the wrong report with inappropriate data. So, one has to handle the NULL by replacing them with 0.

3. Aggregations in cross tabs are not flexible:
Aggregations as in the print function where we could either display max or min or sum etc.. wouldnt calculate the % and the functions has to be used in case of any special calculations. It cannot be handled in attributes with regular expressions.

Thanks
Asha Bidiyasar

Scheduling Jobs in Pentaho Data Integration

Scheduling Jobs in Pentaho Data Integration

The following are the steps to schedule a kettle job:

In the main kettle job (.kjb) , there is a START icon should be there. Double click on the icon and it will pop up a job scheduling window as follows:

By default, the type is No Scheduling and the time options are disabled.

1

If we select the type as Interval, it will enable the Interval related time options as shown in the below image:

2

If the type is Daily, then respective time of day is enabled as shown in the image:

3

If the Type is Weekly: Then time of Day and Day of week is enabled.

4

If the Type is Monthly then Time of Day and Day of Month both options are enabled as shown in below image:

5

If you want to repeat the scheduling time you can enable Repeat option in the Job scheduling window.

Thanks you

Lalitha

Merge Join Vs Stream lookup in Pentaho DI

Merge Join Vs Stream lookup in Pentaho DI

Merge Join:

It joins two data sets which are coming from two table Inputs. In this steps the below Types of Joins are available:

  • FULL OUTER: all rows from both sources will be included in the result, with empty values for non-matching keys in both data streams
  • LEFT OUTER: all rows from the first source will be in the result, with empty values for non-matching keys in the second data stream
  • RIGHT OUTER: all rows from the second source will be in the result, with empty values for non-matching keys in the first data stream
  • INNER JOIN: only rows having the same key in both sources will be included in the result

Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine.

 Stream Lookup:

The Stream lookup step type allows you to look up data using information coming from other steps in the transformation. The data coming from the Source step is first read into memory and is then used to look up data from the main stream.

Preserve memory:  Encodes rows of data to preserve memory while sorting. (Technical background: Kettle will store the lookup data as raw bytes in a custom storage object that uses a hashcode of the bytes as the key. More CPU cost related to calculating the hashcode, less memory needed.)

Key and value are exactly one integer field: Preserves memory while executing a sort by . Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Use sorted list” option.
(Technical background: The lookup data is stored in a custom storage object that is similar to the byte array hashmap, but it doesn’t have to convert to raw bytes. It just takes a hashcode of the long.)

Use sorted list: Enable to store values using a sorted list; this provides better memory usage when working with data sets containing wide row. Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Key and value are exactly one integer field” option. (Technical background: the lookup data is put into a tuple and stored in a sorted list. Lookups are done via a binary tree search.)

Thank you

Lalitha

Memory Group By Vs Group By in Pentaho DI

Memory Group By Vs Group By in Pentaho DI

Group By and Memory Group By: These steps are useful to calculate aggregate values for the group of fields.

Even though both are useful for same purpose there are some differences and similarities as follows:

1) Group By works  only when the input rows are in sorted order  otherwise it will gives accurate result for the sequence records. So before using the Group By step the result should be sorted. Where as Memory Group by will handle non sorted records also.internally it will sort out and gives the output based on the selected aggregate option.

2) In Group By, there is an option to include all rows. If you enable this option it will give all the records, not just the aggregation records. to differentiate these two type of rows on output , we have to provide one boolean type of flag column. In Memory Group By we do not have this option.

3) In Group By, we have Temporary Files Directory option. If we enable include all rows then this option will be available. This option is to select the directory in which the temporary files are stored (needed when the Include all rows option is enabled and the number or grouped rows exceed 5000 rows); the default is the standard temporary directory for the system. This is the option which allows us to store the data on disk so that if we have large amount of data we always prefer this one. And the Memory Group By step cannot handle huge amount of data as it always internally sorts out the data which will cause performance issue. So it will work fine for small amount of data only. So when the number of rows is too large to fit into memory, you need to use the combination of the Sort rows and Group by steps.

4) There is one more additional option in Group By step is, Add line number, restart in each group: Enable to add a line number that restarts at 1 in each group.

5) Always give back a result row : This option is same in both Group By and Memory Group By steps. If you enable this option, it will always give back a result row, even if there is no input row.  This can be useful if you want to count the number of rows.  Without this option you would never get a count of zero (0).

6) The field that make up the group: This option also works same in both the steps. After retrieving fields using the Get Fields button, designate the fields to include in the group.

7) Aggregates: Group By and Memory Group By , these both steps have same aggregate type functions. Specify the fields that must be aggregated, the method and the name of the resulting new field.
Here are the available aggregation methods :

  • Sum
  • Average (Mean)
  • Median
  • Percentile
  • Minimum
  • Maximum
  • Number of values (N)
  • Concatenate strings separated by , (comma)
  • First non-null value
  • Last non-null value
  • First value (including null)
  • Last value (including null)
  • Cumulative sum (all rows option only!)
  • Cumulative average (all rows option only!)
  • Standard deviation
  • Concatenate strings separated by <Value>: specify the separator in the Value column
  • Number of distinct values
  • Number of rows (without field argument)

Thank You

Lalitha

Dynamic Schema Processor: How to implement security in Pentaho using Dynamic Schema Processor

Implementation of Security in Pentaho using Dynamic Schema Processor

         In my earlier post, I had explained the meaning of Dynamic Schema processor (DSP) and why do we need it. Now if you have a Mondrian Schema uploaded to the BA Server and you would like to modify it in the run time. For instance in our project we change it for every user authenticated to the platform to apply security restrictions. This modification aims to limit records that users are able to see in reports based on that Mondrian Schema. So, basically DSP is used to implement security on the uploaded Mondrian schema.

1. How to implement DSP?

The solution is we have to create a JAR file which extends the Dynamic Schema Processor from Mondrian library. I would be explaining the implementation of DSP using an example,

  • Pentaho Server: Pentaho BA Server 6
  • Database Name: MySQL/sampledata
  • Schema Name: Sales

The aim is to show data based upon the country to which the logged in user belongs to. Below is the resultset of the Users belonging to various countries.

Resultset

Now on top of it, a mondrian schema is created using Schema workbench.

schema

  • Original Schema:

Originalschema

Let’s upload the schema to the BA Server. Click on the Manage Data Source button in Home Page, then open the link Import Analysis from the drop-down menu under the icon on the left to New Data Source button. Select the file with our Mondrian Schema and a data source that links to our database. Alternatively we can also publish it through our Schema workbench using the publish option inside the File menu.

Now we have the schema uploaded on the server. The Mondrian cubes can be accessed using Pentaho analyzer/jPivot/Saiku plug-in. The schema will display the data for all dimensions and facts.

  • Modifying the schema:

Now let’s modify the schema slightly by adding a SQL dialect to the xml schema where the CUSTOMER_W_TER table is mentioned (You can also modify the schema from the DSP itself but for now let’s modify this thing in the schema itself).

 

modifiedSchema

Note: The ‘%PLACE_HOLDER%’ value is a where clause query that is defined in the properties file. It is basically the column in which you want to filter data on. Here the column is country in which the user is present. We have to make sure there is a user table and it has some relationship with the column with which we are filtering the data.

2. Creating Dynamic Schema Processor:

Let’s create the Dynamic Schema Processor that will modify the Mondrian schema file so that when we use the Market dimension, only the records for the specified country are displayed. In order to implement that, we need to create a class that implement a DynamicSchemaProcessor interface (we need to extend the filterDynamicSchemaProcessor/ localizingDynamicSchemaProcessor which implements the interface). Below are the code which extends the DynamicSchemaProcessor interface,

 

DSPCode

The variable filter fetches the where clause query from the properties file which will also be kept in the server folder. The function which fetches the details have been defined in another class,

DSPCode1

DSPCode3

 

3. How the code works?

  • Once user logs into the server, the class “ChangeDynamicSchemaProcessor” is called and the schema is passed into the code and is assigned to the variable “originalschema”.
  • All the properties are loaded using a properties file.
  • The properties file is used to configure the db connection,the where clause which will be added to the schema file, the query to generate Country for that particular user.
  • Once the properties is loaded, the class calls a function “getdetails”(present in the class Connectiontest which is initialized first) which passes the properties file along with username and the role of that particular user.
  • Inside the Connectiontest class, it is checked against the role of that particular user. If the role is Admin, then it will return “1=1” or else it will execute the query defined inside the properties file. Once the query is executed, it will return the country name for that particular user.
  • Once the country name is returned, it returns to the main class where the schema is modified and the where clause is added in place of “%PLACE_HOLDER%” variable.
  • Once the schema is modified, the modified schema is displayed with only the data for that particular country.

 

4. Configuring Dynamic Schema Processor on BA Server:

Once the code is completed, we need to create a JAR file to upload the class on the server. In order to use the newly created Dynamic Schema Processor you have to add the jar to the following location: /biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/. Then you have to edit the configuration of the schema in the BA Server to specify that this particular DSP should be used for this Schema. You go again to Manage Data Sources select the schema and then click Edit from drop-down menu and then add a new parameter called DynamicSchemaProcessor with value linking to your class (see picture below). Also the properties file needs to added to the following location: /biserver-ee/tomcat/webapps/pentaho/WEB-INF/classes. Once complete, restart the server.

DatasourceLocation

Adding the class name to the parameter list so that whenever the cube is executed, the class is called. Please see the image below

datasourceConfig

Now if we create the same report we will get different results than in the beginning.

 

5. Configuring the Saiku analytics to work with DSP:

The above configuration will work with Pentaho Analyzer. But in case of Community Edition, the Saiku plugin is required to run the analysis. Now since Saiku has its own Mondrian engine, we need to configure on the saiku configuration as well.

Saiku has its own Mondrian engine. In order to work, we have to add the JAR file to the lib folder of Saiku present in pentaho-solutions/system/saiku/lib folder. Once the JAR is loaded, we need to do one more thing in order to make the plugin work.

A file named as plugin.spring.xml is present in the pentaho-solutions/system/saiku folder where the configuration property for Dynamic Schema is commented out. We just need to remove the comments so that Saiku will activate the DSP in its engine. Just uncomment the dynamicschema processor property as shown below,

SaikuConfig

Once it is completed, restart the BA Server. The saiku analytics will work.

6. Checking on the errors and logging:

The problem with DSP is that, if there is any error it won’t let the cube to be displayed on the BA Server. In order to verify and check the errors, we need to directly look into the log files present inside the log folder of the Pentaho Server. We wont be able to see the cube, if we have any issue on the code. If the code is working properly, it would display the modified schema on the pentaho-log.log file.

7. Disadvantages and things to be taken care of:

Dynamic Schema processor is an added scenario that has been provided by Pentaho so that we can have enhanced security on reports. However there are some disadvantages with DSP,

  • DSP doesn’t support jPivot: In the Pentaho Server Community Edition, jPivot is used for analysis reports. But DSP doesn’t work with JPivot reports. We did all kinds of research but it didn’t work. One can configure the Saiku Plugin present in the marketplace and then we can use DSP on CE Server.
  • Performance issues: If the data is huge, then the report might run slowly since it is calling the underlying class.
  • Mention the class name properly because a minor issue and the cube wont display on the BA Server.
  • It is advisable to extend the LocalizingDynamicSchemaProcessor class instead of filterDynamicSchemaProcessor because in some cases the code doesn’t work.

 

8. Conclusion:

DSP is a wonderful thing that pentaho has provided to implement on reports. It helps us to maintain enhanced security. It can be implemented on cubes, prpt reports and also on the interactive reports. The reports get modified dynamically when a user logs into the BA server.

Introduction to Dynamic Schema Processor- Modifying Mondrian Schema in runtime

Introduction to Dynamic Schema Processor- Modifying Mondrian Schema in runtime

What is Mondrian?

Mondrian is an open source OLAP Server which is written in Java. It supports MDX (Multi Dimensional Expression) query language and the XML for analytical and olap4j interface specifications. It reads from SQL and other data sources and aggregates data in a memory cache.

What is Dynamic Schema Processor?

A dynamic schema processor is used to dynamically change a Mondrian schema at runtime. Once the Mondrian schema is uploaded on the BA Server and you’d like to modify it on the runtime, then we need to implement Dynamic Schema Processor. For instance, let’s say we have to show the data related to the user logging into the server then we have to modify the schema dynamically in order to achieve this. This modification aims to limit records that users are able to see in reports based on the Mondrian Schema.

Why do we need Dynamic Schema Processor(DSP)?

Mondrian allows us to define user roles on the Schema itself. But we cannot add the data restriction dynamically. We can add various roles and then we can restrict the data based on that role. However, it is a very lengthy process and it is not advisable if we have many users. Hence, Mondrian has provided a dynamic schema processor which dynamically changes the schema at runtime. To do that we need to create a class that implements a DynamicSchemaProcessor interface (we need to extend the filterDynamicSchemaProcessor/ localizingDynamicSchemaProcessor which implements the interface). Once the class is created, we need to upload the JAR into the server and also we need to call the class (the class created for Dynamic Schema) where the Schema is uploaded.

Architecture of Dynamic Schema Processor:

DSP_Architecture

 

Above is the basic flow of how Dynamic Schema works in the Pentaho Server:

  1. User logs in Pentaho BA Server.
  2. The JAR loaded inside the server is called by the server internally and the class calls the DSP.
  3. The DSP class is called.
  4. The session variable sets the username which is passed into the SQL Query.
  5. The username is passed to the SQL query defined in the properties file(upload the property file in the classes folder) and it fetches the corresponding ID from the user table.
  6. The query is generated and the ID of the corresponding user is fetched.
  7. DSP returns the ID to the schema.
  8. User then access the Analyzer report inside the BA Server.
  9. The cube calls the DSP class (the class name is defined inside the datasource configuration).
  10. The DSP modifies the cube by adding a where clause condition to the schema to filter out the data based on the ID for the corresponding user.
  11. The data for that particular ID is generated.
  12. The analyzer report is generated for the particular user logged in.

Exception: In case of the Admin user, the schema file will not add the where clause for filtering and the user will have access to the whole data.

Conclusion

So this is how the Dynamic Schema Processor works in a Pentaho server. It is an added feature that Pentaho has provided to provide enhanced security on the data that are getting generated on the reports. The reports gets modified dynamically based upon the user who logs into the server.

Regards,

Nitish Kumar Mishra

Issues while implementing loops in Pentaho Data Integration

Issues while implementing loops in Pentaho Data Integration

Generally for implementing batch processing we use the looping concept provided by Pentaho in their ETL jobs. The loops in PDI are supported only on jobs(kjb) and it is not supported in transformations(ktr).

While implementing loops in PDI, we have come across many blog suggesting us to use “Wait For” step and join the output hop to the previous step. Look into the below screenshot for more clarification,

However the limitation in this kind of looping is that in PDI this causes recursive stack allocation by JVM during job execution and the system may run out of memory after a high number of iterations (depending the system available available memory). While implementing this, the JVM may run out of memory and the program crashes. So it is not advisable to implement to have higher number of iterations while implementing loops in PDI.

Possible Solutions:

1. The first thing you have to take is to minimize the number of iterations. The looping works properly up to 500 iterations. Try reducing it to less than 500 iterations.

2. Never use loops for scheduling. For scheduling purposes if we use the looping concept, it goes into an infinite loop which crashes the whole program.

3. Increase your batch size so that number of iterations is less. While implementing external batch processing, take this thing into consideration.

4. For incrementing the value, it is advisable to use another separate transformation instead of a javascript because the javascript cosumes more memory compared to a separate transformation. Create a new transformation, use the formula step to increment the values and then set those variables.

5. Suggested approach for infinite looping – One of the possible way is to use the settings of ‘Start’ step. Set the ‘Repeat’ flag and add interval configuration. This cause the job to be re-initialize completely as a new instance and does not cause any memory issue.

 

Thanks,

Nitish Kumar Mishra

Handling Null values in Talend and Pentaho

Handling Null values in Talend and Pentaho

Handling Null values in Talend:

Make all the columns as Nullable in tMap

Now for the columns which are having Null values you can assign default values by writing  ternary operator as shown in below screen-shot.

talend_tmap

Handling Null values in Pentaho:

In Pentaho, to handle null values we have a component to provide a default value for the null values:

‘If field value is null’ is the component name.

You can assign default value for null values for the fields and also as per the datatypes.

You can configure the component as shown below:

pentaho_null

Thank you.

Lalitha

Looping and batch processing in PDI

Implementing Loops and batch processing in Kettle

What is batch processing?
Batch processing is basically a series of execution of jobs from source to destination. Here the data is divided into smaller batches and transferred to the destination. This helps in faster execution and also helps in memory management. This is a popular technique used in ETL processes where the data is generally very large.

Creating loops in PDI:
Lets say suppose you want to implement a for loop in PDI where you want to send 10 lakhs of records in batches of 100. Looping technique is complicated in PDI because it can only be implemented in jobs not in the transformation as kettle doesnt allow loops in transformations.
So inorder to implement loops in PDI, you need a main job which will call the subjob where the loops are desgined and the transformation which contains the logic to send data from source to destination.

The below is the overall design of the subjob which will help in batch processing and in looping.

Pic1

 

NrBatches: This is the transformation which will process the batch where it will count the total data present in the table and based on that count it will divide into various batches. Kindly check the below screenshots for more details

NrBatches Transformation design:

Pic2

 

The table input component is used to generate the count and min value of primary key(ID as primary key) from the source table. The Modified javascript value component is used to assign the batch size and to generate the no of batches using batch count(which is count value of the table) and the batch size. Kindly look into the below screenshot for better understanding.

Pic3

 

The set variables component present in the NrBatches transformation is used to set the values of batch size, batch count and no of batches so that these can be in other transformations.

Table Exists: This step is used to check whether the destination table is present or not. If its not present it will execute the SQL statement to create the table and then it goes into the next step. If it exists then it goes to the next step.

Set variables: The set variables component present in the Jobs is used to set the incrementer value(Just like i component in for loop). The initial value of the incrementor is set here. Kindly refer the below screenshot for more clarification.

Pic4

 

Simple Evaluation:

This step is equivalent to the conditional statement where the iterator is checked against the success condition(much like the If conditional statement where lets say for example if(i<50) or while(i< 50)). In this step, the variable set in the previous step is checked against the total no of batches(created in NrBatches Transformation). Kindly check the below screenshot for more clarification.

Pic5

 Transformation Step:

Inside the transformation, set the flags i.e the start flags and end flags using Javascript component. Please see the below screenshot for more clarification on how to set the flags. Alternatively you can also use the formula step or the calculator step which is a better option if you have large amount of data.

.javascript

Pic6

 

 

 

 

 

 

 

Increment Step:

Now since we are using a loop, there has to be an increment. Here we used the javascript which will increment the batch_no(see Set variables step)and thereby the looping mechanism gets completed. Here also you can use another transformation instead of Javascript step which helps in faster processing and consumes lesser memory. Please see the below screenshot for more understanding.

-Javascript Step:

Pic7

-Using Another Transformation:

BatchProcessIncrement

The formula step is used to increment the batch no. See the below screenshot for more clarification.

Formula

So this is how the looping works in Kettle. But there are some limitation to this. This looping technique consumes a lot of memory space if the number of iterations is very high. Use this technique only for problems where the number of loop executions is very low. It will help in enhacing the performance.

Thanks,

Nitish