Data Analysis Techniques

Data Analysis Techniques    

The need and way to analyse data basically depends on the end and not on the source.That end is typically the need to perform analysis and decision making through the use of that source of data. Data analysis in these days typically include reporting, multidimensional analysis and data mining which relates to “Display”, “Analyse” and “Discover” respectively. Depending on the type of data analysis the source data’s requirement may vary.

     If reporting is required for analysis then simplest of the data source would give best results.Query and reporting capability primarily consists of selecting associated data elements, perhaps summarizing them and grouping them by some category, and presenting the results. Retrieving relevant data from the data warehouse, transforming it into the appropriate context, and displaying it in a readable format.Finally, the report is delivered to the end user in the desired output format be it graph, pie and table form in the required output medium.

     If the objective is to perform multidimensional data analysis, a dimensional data model would be more appropriate. This type of analysis requires that the data model support a structure that enables fast and easy access to the data on the basis of any of numerous combinations of analysis dimensions. For example,you may want to know how many of a specific product were sold on a specific day, in a specific store, in a specific price range.Multidimensional analysis enables users to look at a large number of interdependent factors involved in a business problem and to view the data in complex relationships. End users are interested in exploring the data at different levels of detail, which is determined dynamically. The complex relationships can be analyzed through an iterative process that includes drilling down to lower levels of detail or rolling up to higher levels of summarization and aggregation.This is a data analysis operation whereby the user takes a different viewpoint than is typical on the results of the analysis, changing the way the dimensions are arranged in the result. Like query and reporting,multidimensional analysis continues until no more drilling down or rolling up is performed.

     As said before Data mining is nothing but “Discovery”. This discovery could take the form of finding significance in relationships between certain data elements, a clustering together of specific data elements,or other patterns in the usage of specific sets of data elements. After finding these patterns, the algorithms can infer rules. These rules can then be used to generate a model that can predict a desired behavior, identify relationships among the data, discover patterns, and group clusters of records with similar attributes.Data mining is most typically used for statistical data analysis and knowledge discovery. Statistical data analysis detects unusual patterns in data and applies statistical and mathematical modeling techniques to explain the patterns. Data mining is data driven . There is a high level of complexity in stored data and data interrelations in the data warehouse that are difficult to discover without data mining. Data mining offers new insights into the business that may not be discovered with query and reporting or multidimensional analysis. Data mining can help discover new insights about the business by giving us answers to unasked questions .

     These data analysis techniques offers new insights into the business through keen look into the data for its analysis to fetching quality information, which can be be used for ultimate buisness intelligence through profitable buisness decisions in order to grace buisness upliftment and growth in different sectors.

Types of Dimensions in Data warehouse

Types of Dimensions in Data warehouse

What is Dimension?

Dimension table contains the data about the business. The primary keys of the dimension tables are used in Fact tables with Foreign key relationship. And the remaining columns in the dimension is normal data which is the information about the Objects related to the business.

Eg: Product,Customer,Orders,Company,Date etc.

Below are the different types of Dimensions:

1) Slowly Changing Dimensions (SCD) : Dimensions that change very slowly overtime rather than according to regular schedule.

Below are some popular approaches of SCDs:

Type 0: This is Passive method. The records which were inserted will not change any time.

Type 1: This method overwrites the old record with new values without tracking any historical data.

SCDtype1 SCDtype1_2

Type 2: This methods tracks the historical data by using version number or by using startdate, enddate columns.The drawback is the table will grow vertically. And it requires more space.

SCD2

Type 3: This method will also track the historical data by inserting the column with new value. It preserves limited history.

SCD3

Type 4: This method uses a separate table for storing all the historical data and main table will hold the current_data. And both will pointed with the same surrogate key.

SCD4

Type 6: This is called hybrid method which uses all type 1+ type 2 + type 3. It uses flag values and effective date to differ between old and new records.

2) Rapidly Changing Dimensions: The Dimension which contains rapidly changing attributes. If we maintain any historical data for these type of tables. We will definitely get an issue related to memory and performance.  The solution is to maintain mini dimension tables for historical data  like type 4 Dimension in SCD. The main table should contain the current values and mini dimensions can contains historical data.

3) Junk Dimensions: In the data warehouse design we will come across a situation to use flag values. We can use one single table for this so that in Fact table no need to have multiple columns to store the Primary key values of these flag tables.

Eg:Junk

As per above, if we consider Gender_Marital_Status we can use only 1 single column in Fact table.

4) Inferred Dimensions: The Dimension which is important to create a fact table but it is not yet ready, then we can assign some dummy details for one ID and we can use that ID in fact table. After getting the details then we can update the details in the dimension.

5) Conformed Dimensions: Dimensions which are connected with multiple fact tables are know as conformed Dimension.

Eg: Customer Dimension and Product Dimension are required in Shipment Fact,Sales fact and Service Request Fact

6) Degenerate Dimensions: A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.

Eg: Invoice_Number column in Fact table

Degenerated

7) Role Playing Dimensions: The same dimension which can be used for multiple purpose

Eg: Date Dimension can be used as Date of Sale, Date of Hire etc.

8) Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension

9) Static Dimensions: Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

Thank You

Lalitha

Helical Insight for ETL Tester (QA)

Helical Insight for ETL Tester (QA)

Let’s take an example that your ETL developer has just wrote an ETL script, which loaded the data into the Data Warehouse, few millions of records got inserted into the Data Warehouse.

For QA, it is super difficult to test all (millions) of the data and also a time consuming process. Moreover, there are many other problems like

  • some might not be well versed with writing complex SQL queries
  • There might be multiple data sources involved in generating the Data Warehouse.
  • Writing multiple queries on multiple data sources and then combining it in one single result set would be a sweat breaking task.

Now think of Helical Insight as your friend which helps you to do most of your tasks, allowing you to focus on more critical thinking rather than just doing stare and test whole day. Wondering how Helical Insight can help you achieve the same?

Here is your answer,

  • Helical Insight is capable of communicating with any of your data source(s) like CSV, Excel, MySql, Oracle, SQL Server, Postgres etc.
  • Helical Insight has a drag and drop feature to make your test cases ready, without even having a prior knowledge of SQL queries. Helical Insight generates the SQL queries including the joins between multiple entities. Helical Insight understands your data structure, hence it generates the query even if you are working with different set of data source(s), DW or normalized database, Helical Insight understands all.
  • Using this (drag and drop) facility you can generate all your test cases in a form of report, which you can call it as your test cases. For example, your report does a count of records for a dimension. You can save this report and check the same with source system by creating a separate report. If its a migration project the count should definitely match otherwise we know something is wrong with the ETL Script.
  • Once you have saved these reports (test cases) you can re-executes them again an again to verify your data. In the above situation you have just identified that Helical Insight is really a very useful tool with some really cool features like generating the SQL query for any database or data source(s), ability to execute the queries on any databases or data source(s), ability to save the report / test cases / SQL queries for future usage.

Processing Multi schema files in Talend

Processing Multi schema files in Talend

Some times, we will get the input files which contains the data about different schema format but in single file.  We can process these files to get the data schema wise by using tFileinputMSDelimited.

Raw Data:

Here is a sample raw data file. In which it exists 3 different schema types.

Rawdata

If you observe the first column. It represents the type of that data

ORD means it is Order details

CUS means it is Customer Details

and TRS means it is Transactions.

All these 3 exists in same file. But our requirement is like to have all these 3 types of data in separate files for each type.

Job:

Create the below job to get separate files/outputs for each schema type

In the palette you will find tFileInputMSDelimited component drag it on to the work area.

Now open component tab to configure it as shown below:

component

Click on Multi Schema Editor And configure as shown in below screen shot:

Component_config

If the column separators are different for each schema you can enable Use Multiple Separators option.

After getting the preview click on the Fetch codes

fetchcodes

Now take 3 output components which ever you need it. I selected tLogrow components to display the result on the console.

Execute the job:

Job exec

Now check the output you will get three different tables based on the schema type.

Output:

Output

Thanks and Regards,

Lalitha

Staging DB in Datawarehousing – The WHYs and WHY NOTs

Staging DB in Datawarehousing – The WHYs and WHY NOTs
 

Whenever we talk about designing datawarehouses, staging database is one consideration we have to make. And it depends on many factors whether you choose to have one or not. I feel when we start small, like implementing small datamarts, we may not see the need for a staging database but as the DW implementations grow in size and complexity, staging DBs have a lot of benefits to offer. Staging DBs are often the place to dump source data, the workplace to create temperory intermediate tables for processing the transformation steps in ETLs, to keep what is good, to reject what is bad and so on.

Lets quickly look through the PROs and CONs of having Staging DBs in the DW implementations:

 
The WHYs

Source data accumalation: If data is being pulled from multiple systems Source systems and not all data is available at the same time, we may need to accumalate all raw data pulled at different times and then integrate/merge and load the same.

Data transformation, cleansing, data alignment: Staging DB may be the workplace for the ETL to do the required cleansing, apply transformation rules and perform alignment of data from various systems.

Decoupling of Source from Datawarehouse: Staging DB helps decouple the Source systems from the DW. Source systems are accessed only by ETL jobs which pull data into staging, or they may even choose to push data there. This greatly reduces the amount of time that source systems are accessed by DW loading jobs, reducing contention too.

ETL Performance: Loading staging tables is simple as they are mostly the exact replicas of the source tables, minus the indexes / foreign key relationships. As most of the joins on the data , cleansing, calculations are done on staging tables and may be stored in intermediate temporary tables, loading of DW from staging tables is also simpler. This greatly increases the ETL performance as well as quality of the DW data.

Handle Failure, debugging, restart , recovery: As staging DBs have snapshots of source data, can have extensive logging, intermediate tables used for data transformations etc, they make it easier to debug errors/failures. In case of restarts / failures, the step of pulling the source data may not have to be redone.

Data reconciliation / pre / post load validations : This is very much similar to the point above. Staging DBs are excellent to capture auditing information which can be used for reconciling data, validating data to reject bad data, make sure the data we loaded is correct , consistent with the source, nothing is missed out.

Change Data Capture: Staging DB is also used as one of the mechanisms for CDC. For this the copy of source data tables from previous load are preserved till the next execution. The current and previous source data are compared to identify changes to the data. This is mostly used in scenarios where the source systems do not have means to support change detections.

Simpler, manintenable ETLs: With staging, we divide the process of Extract Transform Load into different processes. Combining the extraction, cleansing, transformation logic, precalculations, summarizations into one single ETL makes it very complex and also a single point for failure. With staging step, ETLs are smaller, simpler and easy to maintain as well.

Prevent Orphan records / race conditions: When we fetch data from source directly and load into DW, there are chances that we sometimes may pull some partial updates happening in the transactional system or the new data might be getting inserted as we pull causing race conditions. With Staging, the data from source is pulled in one single go preventing such problems.

Backup of source: The staging DBs if archived / retained for some specific period of time may act like a backup of the source data which is loaded into the DW.

 

The WHY NOTs

Increases processing time , as data is first loaded into staging tables, transformed and then loaded to the DW. Due to the additional step, there is slight increase in processing time.
Increases latency, source changes take more time to reflect in DW. In realtime/near realtime DW, staging is not usually included.
Environment requirements : Additional Database servers / storage space required.

The overhead that staging may impose is quite less as compared to the benefits we derive. But then again, it completely depends on what suits a given requirement. The above points may however help us in making a choice.

 
Shraddha Tambe | Helical IT Solutions

 
References:

1) http://data-warehouses.net/architecture/staging.html
2) http://dwbi.org/etl/etl/52-why-do-we-need-staging-area-during-etl-load
3) http://danlinstedt.com/allposts/datavaultcat/data-vault-and-staging-area/
4) http://www.dwh-club.com/dwh-bi-articles/change-data-capture-methods.html

Issues while implementing loops in Pentaho Data Integration

Issues while implementing loops in Pentaho Data Integration

Generally for implementing batch processing we use the looping concept provided by Pentaho in their ETL jobs. The loops in PDI are supported only on jobs(kjb) and it is not supported in transformations(ktr).

While implementing loops in PDI, we have come across many blog suggesting us to use “Wait For” step and join the output hop to the previous step. Look into the below screenshot for more clarification,

However the limitation in this kind of looping is that in PDI this causes recursive stack allocation by JVM during job execution and the system may run out of memory after a high number of iterations (depending the system available available memory). While implementing this, the JVM may run out of memory and the program crashes. So it is not advisable to implement to have higher number of iterations while implementing loops in PDI.

Possible Solutions:

1. The first thing you have to take is to minimize the number of iterations. The looping works properly up to 500 iterations. Try reducing it to less than 500 iterations.

2. Never use loops for scheduling. For scheduling purposes if we use the looping concept, it goes into an infinite loop which crashes the whole program.

3. Increase your batch size so that number of iterations is less. While implementing external batch processing, take this thing into consideration.

4. For incrementing the value, it is advisable to use another separate transformation instead of a javascript because the javascript cosumes more memory compared to a separate transformation. Create a new transformation, use the formula step to increment the values and then set those variables.

5. Suggested approach for infinite looping – One of the possible way is to use the settings of ‘Start’ step. Set the ‘Repeat’ flag and add interval configuration. This cause the job to be re-initialize completely as a new instance and does not cause any memory issue.

 

Thanks,

Nitish Kumar Mishra

Handling Null values in Talend and Pentaho

Handling Null values in Talend and Pentaho

Handling Null values in Talend:

Make all the columns as Nullable in tMap

Now for the columns which are having Null values you can assign default values by writing  ternary operator as shown in below screen-shot.

talend_tmap

Handling Null values in Pentaho:

In Pentaho, to handle null values we have a component to provide a default value for the null values:

‘If field value is null’ is the component name.

You can assign default value for null values for the fields and also as per the datatypes.

You can configure the component as shown below:

pentaho_null

Thank you.

Lalitha

Slowly Changing Dimension in Pentaho Data Integration(Kettle)

Slowly Changing Dimension in Pentaho Data Integration(Kettle)

Slowly changing dimension(SCD) is a common mechanism in Datawarehousing concepts. The exact definition of SCD is the dimension that changes slowly over a time rather than on a regular schedule. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension’s attribute value for given date. There are various approaches to deal with the data using SCD. The most commonly used approaches are:

1. SCD Type-I : Update the existing record(Overwrite)

2. SCD Type-II: Creates a new record and set the flag of the new record(historical).

3. SCD Type- III: Creates a new column which keeps the last updated record. Here the history is limited.

In kettle, there are components through which we can implement SCD on the dimension. One such component is the dimensional lookup/update.

SCD

The dimension lookup/update component allows to perform the Type-I and Type-II approach of SCD.

.SCD_1

Keys:The keys are used to lookup the values with the destination table(dimension).

Technical key: It is basically the surrogate key which will be created if a new record is found.

SCD_2

Fields: The fields columns are the fields that are present on the dimension table on which you want to perform operations.A number of optional fields (in the “Fields” tab) are automatically managed by the step. You can specify the table field name in the “Dimension Field” column.

Type of Dimension update:

Insert: This is SCD-II mechanism where the a new row is inserted if changes are found based on the lookup. If the new record coming from the source table is not found then it will insert. if the changes are found on the table based on the lookup values, a new record is inserted.

Update: This is a conventional SCD-I approach . These attributes in the last dimension record version are updated. If we keep all the fields as Update then it performs SCD-I approach. If some of the fields are Insert and some are update then it applies SCD-II and the fields in which Update is applied will only update the data based upon the last version.

Punch Through: This mechanism is applied on those fields where the data changes very rarely and if it changes it is just a correction. For example, in case of names of the products or null values in price columns. It is also used for SCD_I but its slower than the update mechanism.

Note: If u mix Insert,update and punch through in one dimension, it works like hybrid slowly changing mechanism which is of type-6.

Date of last insert or update (without stream field as source) : adds and manges a Date field.

Date of last insert (without stream field as source) : adds and manges a Date fieldDate of last update (without stream field as source) : adds and manges a Date field.

Last version (without stream field as source) : adds and manges a Boolean field. (converted into Char(1) or boolean database data type depending on your database connection settings and availability of such data type). This acts as a current valid dimension entry entry indicator for the last version: So when a type II attribute changes and a new version is created (to keep track of the history) the ‘Last version’ attribute in the previous version is set to ‘False/N’ and the new record with the latest version is set to ‘True/Y’.

Note: This dimension entry is added automatically to the dimension table when the update is first run. If you have “NOT NULL” fields in your table, adding this empty row and then the entire step will fail! So make sure that you have a record with the ID field = 0 or 1 in your table if you don’t want PDI to insert a potentially invalid empty record.

This component will work faster if you apply the caching mechanism. A cache size of 0 caches as many rows as possible and until your JVM runs out of memory. Use this option wisely with dimensions that can’t grow too large. A cache size of -1 means that caching is disabled.

Note: There are various other components which can perform SCD in kettle. The Insert/update can also be used which performs SCD-II mechanism. If a new record is found then it will insert or else it will update the dimension. Similarly, the Update step performs SCD-I mechanism. However these components are very slow compared to dimension lookup/update. Similarly, you can use Merge Join(Diff) and then use Synchronise after merge which also performs SCD mechanism. This is fastest among all the above but however the number of records on both the dimension and staging should be same otherwise it wont work.

Thanks,

Nitish

Full Outer Join by using tmap in Talend

Full Outer Join by using tmap in Talend

In tMap, the default Join Model is Left Outer Join and also we have  Inner Join as another Join Model. Suppose if we need the result for full outer join ,you can follow the below process.

I have Customers.csv and City.csv as my two delimited files. where City is the common field in the both the files.

First, please find the data in the two csv files as below:

Customers.csv:

customers

City.csv:

city

Following is the Talend Job :

job

First create a FileDelimited Metadata in the repository as shown in above picture.

Consider Customers as main link and city as lookup and join both to tMap component. In tMap you have to select Join Model as Inner Join and catch the inner join output reject records and configure as shown in below screen-shot:

tmap1

Now as shown in the job image, again take same city and customer inputs but change city as main link and customers as lookup.

In tmap select the Join Model as Left Outer join and also configure the tMap2 component as shown below:

tmap2

Now take tunite component to retrieve the innerjoin reject values from tMap1 and left outer join values from tMap2. And now take tsortrow to sort the result in asc/desc order(I consider custid column) and then the final result stored on fulljoin.csv file.

And the out put is as follows:

fulljoin

Thanks and Regards,

Lalitha

Looping and batch processing in PDI

Implementing Loops and batch processing in Kettle

What is batch processing?
Batch processing is basically a series of execution of jobs from source to destination. Here the data is divided into smaller batches and transferred to the destination. This helps in faster execution and also helps in memory management. This is a popular technique used in ETL processes where the data is generally very large.

Creating loops in PDI:
Lets say suppose you want to implement a for loop in PDI where you want to send 10 lakhs of records in batches of 100. Looping technique is complicated in PDI because it can only be implemented in jobs not in the transformation as kettle doesnt allow loops in transformations.
So inorder to implement loops in PDI, you need a main job which will call the subjob where the loops are desgined and the transformation which contains the logic to send data from source to destination.

The below is the overall design of the subjob which will help in batch processing and in looping.

Pic1

 

NrBatches: This is the transformation which will process the batch where it will count the total data present in the table and based on that count it will divide into various batches. Kindly check the below screenshots for more details

NrBatches Transformation design:

Pic2

 

The table input component is used to generate the count and min value of primary key(ID as primary key) from the source table. The Modified javascript value component is used to assign the batch size and to generate the no of batches using batch count(which is count value of the table) and the batch size. Kindly look into the below screenshot for better understanding.

Pic3

 

The set variables component present in the NrBatches transformation is used to set the values of batch size, batch count and no of batches so that these can be in other transformations.

Table Exists: This step is used to check whether the destination table is present or not. If its not present it will execute the SQL statement to create the table and then it goes into the next step. If it exists then it goes to the next step.

Set variables: The set variables component present in the Jobs is used to set the incrementer value(Just like i component in for loop). The initial value of the incrementor is set here. Kindly refer the below screenshot for more clarification.

Pic4

 

Simple Evaluation:

This step is equivalent to the conditional statement where the iterator is checked against the success condition(much like the If conditional statement where lets say for example if(i<50) or while(i< 50)). In this step, the variable set in the previous step is checked against the total no of batches(created in NrBatches Transformation). Kindly check the below screenshot for more clarification.

Pic5

 Transformation Step:

Inside the transformation, set the flags i.e the start flags and end flags using Javascript component. Please see the below screenshot for more clarification on how to set the flags. Alternatively you can also use the formula step or the calculator step which is a better option if you have large amount of data.

.javascript

Pic6

 

 

 

 

 

 

 

Increment Step:

Now since we are using a loop, there has to be an increment. Here we used the javascript which will increment the batch_no(see Set variables step)and thereby the looping mechanism gets completed. Here also you can use another transformation instead of Javascript step which helps in faster processing and consumes lesser memory. Please see the below screenshot for more understanding.

-Javascript Step:

Pic7

-Using Another Transformation:

BatchProcessIncrement

The formula step is used to increment the batch no. See the below screenshot for more clarification.

Formula

So this is how the looping works in Kettle. But there are some limitation to this. This looping technique consumes a lot of memory space if the number of iterations is very high. Use this technique only for problems where the number of loop executions is very low. It will help in enhacing the performance.

Thanks,

Nitish