What is an UPSERT?
UPSERT is a combination of two activities in a table i.e. Update and Insert based upon a unique key(Unique iD). A Relational database uses MERGE JOIN to perform UPSERT operation on data where it updates if there is any change on the existing data and INSERTS if there is a new data based upon certain conditions (identified as unique ID).
How UPSERT works in an ETL?
While loading the data from the source to target, there are two prominent approach implemented. Namely,
- Truncate and load: Flush out the old data and reload the new data.
- Insert/Update (Upsert): Update the existing record and Insert if there is a new data. It matches on a particular key or a set of keys to perform Update/Insert. If the keys doesn’t match, then it is an INSERT or else it will update. SCD performs in the same way(SCD-I is an UPSERT whereas SCD-II is an INSERT).
Truncate and load approach are not recommended in some scenarios where the source data only has current data. In case of maintaining historical information as well, we use UPSERT mechanism.
UPSERT in Pentaho Data Integration(PDI) :
There are various components that are used to implement UPSERT in PDI. These components use the unique key as the keys for lookup from the target table and specific operation is performed based upon the condition. The following are the components used for Upsert
- Insert/Update(Insert when there is no match and update if there is match)
- Dimensional Lookup/Update(This is used to SCD Operations)
- Synchronise after Merge(This step needs a flag to execute. Based upon the flags it will insert/update)
The basic algorithm/flow of how the UPSERT functions in PDI is as follows,
Disadvantages on the above steps :
Now the components that are present with PDI and the algorithms designed to implement the UPSERT is very slow while handling large amounts of data.
- The Insert/Update step does a lookup on all the values in the table and does one to one comparison to either insert or update. If the amount of data is huge, then the component will work very slowly and hence the whole process slows down.
- Similar case happens on the dimensional lookup/update and also on the Synchronise after merge step as well. Since it does a lookup on the whole table i.e. it compares with each and every data on the table, it slows down the process.
Alternative approach for the UPSERT technique:
There are certain scenarios where the ETL runs on a regular basis let’s say, on every 12 hours. In that case, the components used to implement the UPSERT technique wont help since they are very slow and if there are huge amounts of data, then its performance will be very slow. So there is a need of optimization. Optimizing an ETL process is one of the most important aspects that we need to undertake since it helps in faster processing. The following diagram will help you understand the alternate approach for UPSERT mechanism,
This steps mentioned in the above diagram works exactly like the Insert/Update component but with some minor tweaks. The following are the description of each step,
- Table Input: Taking input from the Source data that needs to be loaded into the Target.
- Add Sequence: The Sequence number that needs to be added as a business key to the table(This can change according to your requirement).
- Update: This component works only when there is a need for Update. If there is a new data, this component will fail since it only works when there is a need to update on the existing data.
- Table Output: PDI has a functionality that supports error handling of each component. The error handling component of the Update step will bring you to the Table Output component which will INSERT the new data into the target table.
Please see the below diagram for more better understanding,
This approach is faster than the usual Insert/update step since it doesn’t need to lookup if there is a new data. It will directly insert the new data when the Update component fails.
Try once and share your feedback!! 🙂