Types of Dimensions in Data warehouse

Types of Dimensions in Data warehouse

What is Dimension?

Dimension table contains the data about the business. The primary keys of the dimension tables are used in Fact tables with Foreign key relationship. And the remaining columns in the dimension is normal data which is the information about the Objects related to the business.

Eg: Product,Customer,Orders,Company,Date etc.

Below are the different types of Dimensions:

1) Slowly Changing Dimensions (SCD) : Dimensions that change very slowly overtime rather than according to regular schedule.

Below are some popular approaches of SCDs:

Type 0: This is Passive method. The records which were inserted will not change any time.

Type 1: This method overwrites the old record with new values without tracking any historical data.

SCDtype1 SCDtype1_2

Type 2: This methods tracks the historical data by using version number or by using startdate, enddate columns.The drawback is the table will grow vertically. And it requires more space.

SCD2

Type 3: This method will also track the historical data by inserting the column with new value. It preserves limited history.

SCD3

Type 4: This method uses a separate table for storing all the historical data and main table will hold the current_data. And both will pointed with the same surrogate key.

SCD4

Type 6: This is called hybrid method which uses all type 1+ type 2 + type 3. It uses flag values and effective date to differ between old and new records.

2) Rapidly Changing Dimensions: The Dimension which contains rapidly changing attributes. If we maintain any historical data for these type of tables. We will definitely get an issue related to memory and performance.  The solution is to maintain mini dimension tables for historical data  like type 4 Dimension in SCD. The main table should contain the current values and mini dimensions can contains historical data.

3) Junk Dimensions: In the data warehouse design we will come across a situation to use flag values. We can use one single table for this so that in Fact table no need to have multiple columns to store the Primary key values of these flag tables.

Eg:Junk

As per above, if we consider Gender_Marital_Status we can use only 1 single column in Fact table.

4) Inferred Dimensions: The Dimension which is important to create a fact table but it is not yet ready, then we can assign some dummy details for one ID and we can use that ID in fact table. After getting the details then we can update the details in the dimension.

5) Conformed Dimensions: Dimensions which are connected with multiple fact tables are know as conformed Dimension.

Eg: Customer Dimension and Product Dimension are required in Shipment Fact,Sales fact and Service Request Fact

6) Degenerate Dimensions: A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.

Eg: Invoice_Number column in Fact table

Degenerated

7) Role Playing Dimensions: The same dimension which can be used for multiple purpose

Eg: Date Dimension can be used as Date of Sale, Date of Hire etc.

8) Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension

9) Static Dimensions: Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

Thank You

Lalitha

Helical Insight for ETL Tester (QA)

Helical Insight for ETL Tester (QA)

Let’s take an example that your ETL developer has just wrote an ETL script, which loaded the data into the Data Warehouse, few millions of records got inserted into the Data Warehouse.

For QA, it is super difficult to test all (millions) of the data and also a time consuming process. Moreover, there are many other problems like

  • some might not be well versed with writing complex SQL queries
  • There might be multiple data sources involved in generating the Data Warehouse.
  • Writing multiple queries on multiple data sources and then combining it in one single result set would be a sweat breaking task.

Now think of Helical Insight as your friend which helps you to do most of your tasks, allowing you to focus on more critical thinking rather than just doing stare and test whole day. Wondering how Helical Insight can help you achieve the same?

Here is your answer,

  • Helical Insight is capable of communicating with any of your data source(s) like CSV, Excel, MySql, Oracle, SQL Server, Postgres etc.
  • Helical Insight has a drag and drop feature to make your test cases ready, without even having a prior knowledge of SQL queries. Helical Insight generates the SQL queries including the joins between multiple entities. Helical Insight understands your data structure, hence it generates the query even if you are working with different set of data source(s), DW or normalized database, Helical Insight understands all.
  • Using this (drag and drop) facility you can generate all your test cases in a form of report, which you can call it as your test cases. For example, your report does a count of records for a dimension. You can save this report and check the same with source system by creating a separate report. If its a migration project the count should definitely match otherwise we know something is wrong with the ETL Script.
  • Once you have saved these reports (test cases) you can re-executes them again an again to verify your data. In the above situation you have just identified that Helical Insight is really a very useful tool with some really cool features like generating the SQL query for any database or data source(s), ability to execute the queries on any databases or data source(s), ability to save the report / test cases / SQL queries for future usage.

Processing Multi schema files in Talend

Processing Multi schema files in Talend

Some times, we will get the input files which contains the data about different schema format but in single file.  We can process these files to get the data schema wise by using tFileinputMSDelimited.

Raw Data:

Here is a sample raw data file. In which it exists 3 different schema types.

Rawdata

If you observe the first column. It represents the type of that data

ORD means it is Order details

CUS means it is Customer Details

and TRS means it is Transactions.

All these 3 exists in same file. But our requirement is like to have all these 3 types of data in separate files for each type.

Job:

Create the below job to get separate files/outputs for each schema type

In the palette you will find tFileInputMSDelimited component drag it on to the work area.

Now open component tab to configure it as shown below:

component

Click on Multi Schema Editor And configure as shown in below screen shot:

Component_config

If the column separators are different for each schema you can enable Use Multiple Separators option.

After getting the preview click on the Fetch codes

fetchcodes

Now take 3 output components which ever you need it. I selected tLogrow components to display the result on the console.

Execute the job:

Job exec

Now check the output you will get three different tables based on the schema type.

Output:

Output

Thanks and Regards,

Lalitha

Staging DB in Datawarehousing – The WHYs and WHY NOTs

Staging DB in Datawarehousing – The WHYs and WHY NOTs
 

Whenever we talk about designing datawarehouses, staging database is one consideration we have to make. And it depends on many factors whether you choose to have one or not. I feel when we start small, like implementing small datamarts, we may not see the need for a staging database but as the DW implementations grow in size and complexity, staging DBs have a lot of benefits to offer. Staging DBs are often the place to dump source data, the workplace to create temperory intermediate tables for processing the transformation steps in ETLs, to keep what is good, to reject what is bad and so on.

Lets quickly look through the PROs and CONs of having Staging DBs in the DW implementations:

 
The WHYs

Source data accumalation: If data is being pulled from multiple systems Source systems and not all data is available at the same time, we may need to accumalate all raw data pulled at different times and then integrate/merge and load the same.

Data transformation, cleansing, data alignment: Staging DB may be the workplace for the ETL to do the required cleansing, apply transformation rules and perform alignment of data from various systems.

Decoupling of Source from Datawarehouse: Staging DB helps decouple the Source systems from the DW. Source systems are accessed only by ETL jobs which pull data into staging, or they may even choose to push data there. This greatly reduces the amount of time that source systems are accessed by DW loading jobs, reducing contention too.

ETL Performance: Loading staging tables is simple as they are mostly the exact replicas of the source tables, minus the indexes / foreign key relationships. As most of the joins on the data , cleansing, calculations are done on staging tables and may be stored in intermediate temporary tables, loading of DW from staging tables is also simpler. This greatly increases the ETL performance as well as quality of the DW data.

Handle Failure, debugging, restart , recovery: As staging DBs have snapshots of source data, can have extensive logging, intermediate tables used for data transformations etc, they make it easier to debug errors/failures. In case of restarts / failures, the step of pulling the source data may not have to be redone.

Data reconciliation / pre / post load validations : This is very much similar to the point above. Staging DBs are excellent to capture auditing information which can be used for reconciling data, validating data to reject bad data, make sure the data we loaded is correct , consistent with the source, nothing is missed out.

Change Data Capture: Staging DB is also used as one of the mechanisms for CDC. For this the copy of source data tables from previous load are preserved till the next execution. The current and previous source data are compared to identify changes to the data. This is mostly used in scenarios where the source systems do not have means to support change detections.

Simpler, manintenable ETLs: With staging, we divide the process of Extract Transform Load into different processes. Combining the extraction, cleansing, transformation logic, precalculations, summarizations into one single ETL makes it very complex and also a single point for failure. With staging step, ETLs are smaller, simpler and easy to maintain as well.

Prevent Orphan records / race conditions: When we fetch data from source directly and load into DW, there are chances that we sometimes may pull some partial updates happening in the transactional system or the new data might be getting inserted as we pull causing race conditions. With Staging, the data from source is pulled in one single go preventing such problems.

Backup of source: The staging DBs if archived / retained for some specific period of time may act like a backup of the source data which is loaded into the DW.

 

The WHY NOTs

Increases processing time , as data is first loaded into staging tables, transformed and then loaded to the DW. Due to the additional step, there is slight increase in processing time.
Increases latency, source changes take more time to reflect in DW. In realtime/near realtime DW, staging is not usually included.
Environment requirements : Additional Database servers / storage space required.

The overhead that staging may impose is quite less as compared to the benefits we derive. But then again, it completely depends on what suits a given requirement. The above points may however help us in making a choice.

 
Shraddha Tambe | Helical IT Solutions

 
References:

1) http://data-warehouses.net/architecture/staging.html
2) http://dwbi.org/etl/etl/52-why-do-we-need-staging-area-during-etl-load
3) http://danlinstedt.com/allposts/datavaultcat/data-vault-and-staging-area/
4) http://www.dwh-club.com/dwh-bi-articles/change-data-capture-methods.html

Issues while implementing loops in Pentaho Data Integration

Issues while implementing loops in Pentaho Data Integration

Generally for implementing batch processing we use the looping concept provided by Pentaho in their ETL jobs. The loops in PDI are supported only on jobs(kjb) and it is not supported in transformations(ktr).

While implementing loops in PDI, we have come across many blog suggesting us to use “Wait For” step and join the output hop to the previous step. Look into the below screenshot for more clarification,

However the limitation in this kind of looping is that in PDI this causes recursive stack allocation by JVM during job execution and the system may run out of memory after a high number of iterations (depending the system available available memory). While implementing this, the JVM may run out of memory and the program crashes. So it is not advisable to implement to have higher number of iterations while implementing loops in PDI.

Possible Solutions:

1. The first thing you have to take is to minimize the number of iterations. The looping works properly up to 500 iterations. Try reducing it to less than 500 iterations.

2. Never use loops for scheduling. For scheduling purposes if we use the looping concept, it goes into an infinite loop which crashes the whole program.

3. Increase your batch size so that number of iterations is less. While implementing external batch processing, take this thing into consideration.

4. For incrementing the value, it is advisable to use another separate transformation instead of a javascript because the javascript cosumes more memory compared to a separate transformation. Create a new transformation, use the formula step to increment the values and then set those variables.

5. Suggested approach for infinite looping – One of the possible way is to use the settings of ‘Start’ step. Set the ‘Repeat’ flag and add interval configuration. This cause the job to be re-initialize completely as a new instance and does not cause any memory issue.

 

Thanks,

Nitish Kumar Mishra

Handling Null values in Talend and Pentaho

Handling Null values in Talend and Pentaho

Handling Null values in Talend:

Make all the columns as Nullable in tMap

Now for the columns which are having Null values you can assign default values by writing  ternary operator as shown in below screen-shot.

talend_tmap

Handling Null values in Pentaho:

In Pentaho, to handle null values we have a component to provide a default value for the null values:

‘If field value is null’ is the component name.

You can assign default value for null values for the fields and also as per the datatypes.

You can configure the component as shown below:

pentaho_null

Thank you.

Lalitha

Slowly Changing Dimension in Pentaho Data Integration(Kettle)

Slowly Changing Dimension in Pentaho Data Integration(Kettle)

Slowly changing dimension(SCD) is a common mechanism in Datawarehousing concepts. The exact definition of SCD is the dimension that changes slowly over a time rather than on a regular schedule. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension’s attribute value for given date. There are various approaches to deal with the data using SCD. The most commonly used approaches are:

1. SCD Type-I : Update the existing record(Overwrite)

2. SCD Type-II: Creates a new record and set the flag of the new record(historical).

3. SCD Type- III: Creates a new column which keeps the last updated record. Here the history is limited.

In kettle, there are components through which we can implement SCD on the dimension. One such component is the dimensional lookup/update.

SCD

The dimension lookup/update component allows to perform the Type-I and Type-II approach of SCD.

.SCD_1

Keys:The keys are used to lookup the values with the destination table(dimension).

Technical key: It is basically the surrogate key which will be created if a new record is found.

SCD_2

Fields: The fields columns are the fields that are present on the dimension table on which you want to perform operations.A number of optional fields (in the “Fields” tab) are automatically managed by the step. You can specify the table field name in the “Dimension Field” column.

Type of Dimension update:

Insert: This is SCD-II mechanism where the a new row is inserted if changes are found based on the lookup. If the new record coming from the source table is not found then it will insert. if the changes are found on the table based on the lookup values, a new record is inserted.

Update: This is a conventional SCD-I approach . These attributes in the last dimension record version are updated. If we keep all the fields as Update then it performs SCD-I approach. If some of the fields are Insert and some are update then it applies SCD-II and the fields in which Update is applied will only update the data based upon the last version.

Punch Through: This mechanism is applied on those fields where the data changes very rarely and if it changes it is just a correction. For example, in case of names of the products or null values in price columns. It is also used for SCD_I but its slower than the update mechanism.

Note: If u mix Insert,update and punch through in one dimension, it works like hybrid slowly changing mechanism which is of type-6.

Date of last insert or update (without stream field as source) : adds and manges a Date field.

Date of last insert (without stream field as source) : adds and manges a Date fieldDate of last update (without stream field as source) : adds and manges a Date field.

Last version (without stream field as source) : adds and manges a Boolean field. (converted into Char(1) or boolean database data type depending on your database connection settings and availability of such data type). This acts as a current valid dimension entry entry indicator for the last version: So when a type II attribute changes and a new version is created (to keep track of the history) the ‘Last version’ attribute in the previous version is set to ‘False/N’ and the new record with the latest version is set to ‘True/Y’.

Note: This dimension entry is added automatically to the dimension table when the update is first run. If you have “NOT NULL” fields in your table, adding this empty row and then the entire step will fail! So make sure that you have a record with the ID field = 0 or 1 in your table if you don’t want PDI to insert a potentially invalid empty record.

This component will work faster if you apply the caching mechanism. A cache size of 0 caches as many rows as possible and until your JVM runs out of memory. Use this option wisely with dimensions that can’t grow too large. A cache size of -1 means that caching is disabled.

Note: There are various other components which can perform SCD in kettle. The Insert/update can also be used which performs SCD-II mechanism. If a new record is found then it will insert or else it will update the dimension. Similarly, the Update step performs SCD-I mechanism. However these components are very slow compared to dimension lookup/update. Similarly, you can use Merge Join(Diff) and then use Synchronise after merge which also performs SCD mechanism. This is fastest among all the above but however the number of records on both the dimension and staging should be same otherwise it wont work.

Thanks,

Nitish

Full Outer Join by using tmap in Talend

Full Outer Join by using tmap in Talend

In tMap, the default Join Model is Left Outer Join and also we have  Inner Join as another Join Model. Suppose if we need the result for full outer join ,you can follow the below process.

I have Customers.csv and City.csv as my two delimited files. where City is the common field in the both the files.

First, please find the data in the two csv files as below:

Customers.csv:

customers

City.csv:

city

Following is the Talend Job :

job

First create a FileDelimited Metadata in the repository as shown in above picture.

Consider Customers as main link and city as lookup and join both to tMap component. In tMap you have to select Join Model as Inner Join and catch the inner join output reject records and configure as shown in below screen-shot:

tmap1

Now as shown in the job image, again take same city and customer inputs but change city as main link and customers as lookup.

In tmap select the Join Model as Left Outer join and also configure the tMap2 component as shown below:

tmap2

Now take tunite component to retrieve the innerjoin reject values from tMap1 and left outer join values from tMap2. And now take tsortrow to sort the result in asc/desc order(I consider custid column) and then the final result stored on fulljoin.csv file.

And the out put is as follows:

fulljoin

Thanks and Regards,

Lalitha

Looping and batch processing in PDI

Implementing Loops and batch processing in Kettle

What is batch processing?
Batch processing is basically a series of execution of jobs from source to destination. Here the data is divided into smaller batches and transferred to the destination. This helps in faster execution and also helps in memory management. This is a popular technique used in ETL processes where the data is generally very large.

Creating loops in PDI:
Lets say suppose you want to implement a for loop in PDI where you want to send 10 lakhs of records in batches of 100. Looping technique is complicated in PDI because it can only be implemented in jobs not in the transformation as kettle doesnt allow loops in transformations.
So inorder to implement loops in PDI, you need a main job which will call the subjob where the loops are desgined and the transformation which contains the logic to send data from source to destination.

The below is the overall design of the subjob which will help in batch processing and in looping.

Pic1

 

NrBatches: This is the transformation which will process the batch where it will count the total data present in the table and based on that count it will divide into various batches. Kindly check the below screenshots for more details

NrBatches Transformation design:

Pic2

 

The table input component is used to generate the count and min value of primary key(ID as primary key) from the source table. The Modified javascript value component is used to assign the batch size and to generate the no of batches using batch count(which is count value of the table) and the batch size. Kindly look into the below screenshot for better understanding.

Pic3

 

The set variables component present in the NrBatches transformation is used to set the values of batch size, batch count and no of batches so that these can be in other transformations.

Table Exists: This step is used to check whether the destination table is present or not. If its not present it will execute the SQL statement to create the table and then it goes into the next step. If it exists then it goes to the next step.

Set variables: The set variables component present in the Jobs is used to set the incrementer value(Just like i component in for loop). The initial value of the incrementor is set here. Kindly refer the below screenshot for more clarification.

Pic4

 

Simple Evaluation:

This step is equivalent to the conditional statement where the iterator is checked against the success condition(much like the If conditional statement where lets say for example if(i<50) or while(i< 50)). In this step, the variable set in the previous step is checked against the total no of batches(created in NrBatches Transformation). Kindly check the below screenshot for more clarification.

Pic5

 Transformation Step:

Inside the transformation, set the flags i.e the start flags and end flags using Javascript component. Please see the below screenshot for more clarification on how to set the flags. Alternatively you can also use the formula step or the calculator step which is a better option if you have large amount of data.

.javascript

Pic6

 

 

 

 

 

 

 

Increment Step:

Now since we are using a loop, there has to be an increment. Here we used the javascript which will increment the batch_no(see Set variables step)and thereby the looping mechanism gets completed. Here also you can use another transformation instead of Javascript step which helps in faster processing and consumes lesser memory. Please see the below screenshot for more understanding.

-Javascript Step:

Pic7

-Using Another Transformation:

BatchProcessIncrement

The formula step is used to increment the batch no. See the below screenshot for more clarification.

Formula

So this is how the looping works in Kettle. But there are some limitation to this. This looping technique consumes a lot of memory space if the number of iterations is very high. Use this technique only for problems where the number of loop executions is very low. It will help in enhacing the performance.

Thanks,

Nitish

 

Audit Logs in Pentaho Data Integration (PDI)

Audit Logs in Pentaho Data Integration

Audit Logs at Job level and Transformation Level are very useful for ETL projects to track the details regarding Job name, Start Date, End Date, Transformation Name, Error,Number of Lines Read, Number of Line Write, Number of lines from Input, Number of Lines in output etc.

I have created a simple PDI Job to track audit logs at Job level, Job entry level and Transformation Level in Mysql database. Please find the process as follows:

Created a Kettle Job FileToDB.kjb which contains two job entries (Start and Transformation) and one database connection(PSQL_Log). I have created a separate database ‘PDI_Audit_Logs’ in Mysql DB. I configured this DB details as database connection.

FileToDB_KJB

Right click on the work area and select Job settings as shown in below screen shot.

jobselect1

It will pop up a Job Properties window. Click on Log tab. Select ‘Job entry log table’ option, it will display the variable names under this section. You can enable all the variable are if you consider default variables is also fine. Now configure the log table connection as shown in below screen shot:

Jobproperties

Log Connection–> We have already created this connection.(In this case MYSQL DB)

Log Schema–>Existed Database Name

**Log Table–> Table Name (in which you want to store the tracked details). It can be existed table name.  We will see the process for new table in the later steps.

Log line timeout–>Number of days that you want to store the log details.

**If you are creating the log table first time you have to click on SQL button which is at the below otherwise click on Ok button. So SQL button will pop up SQL statement to create this table with the above shown column names as . I gave log table name as “FileToDB_JobEntryLog” My job name and level of logging.

sql1

click on Execute button, it will create that table in the given schema/database. And will prompt below window. click on OK and then click on close of SQL editor.

ex1

So Job Entry level Log configuration is done. Now select Job Log table section to configure the log connection for Job level. Configure the values as shown in below Screen shot.

And for Log Table Name, same steps have to follow as explained above .

job1

Now Job Audit Logs settings are done. Open the transformation. it has system info, file input and DB output steps connected with Hops. And the DB output schema is different and Audit Logs schema is different So we have created two connection PSQL and PSQL_Log.

Trans_img

Right click on the work area and select Transformation Settings.

trans_set

It will pop up Transformation Properties window. Select Logging tab.And then click on Transformation option. It will display the variables that can be tracked from this option.

trans2

For the fields LINES_READ,LINES_WRITTEN,LINES_UPDATED,LINES_INPUT,LINES_OUTPUT,LINES_REJECTED should be given a step names that it should track from which step accordingly.

Same like Job settings here also you have to configure Log connections and click on SQL button if it is first time to create the Log Table Name otherwise click on OK.

Now Audit Logs created for transformation also. So execute the Job. And check in the database for these Log tables. It will create 3 tables (job Entry level, job level, Transformation level ) under the given schema.

Job Entry Level Table:

As there are two Job entries it generated the details for both the job entries

job_entry_log_table

Job Level Log Table:

joblogtable

Transformation Level Log Table:

translogtable

Hope this topic will be useful.

Thank you.