Merge Join Vs Stream lookup in Pentaho DI
It joins two data sets which are coming from two table Inputs. In this steps the below Types of Joins are available:
- FULL OUTER: all rows from both sources will be included in the result, with empty values for non-matching keys in both data streams
- LEFT OUTER: all rows from the first source will be in the result, with empty values for non-matching keys in the second data stream
- RIGHT OUTER: all rows from the second source will be in the result, with empty values for non-matching keys in the first data stream
- INNER JOIN: only rows having the same key in both sources will be included in the result
Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine.
The Stream lookup step type allows you to look up data using information coming from other steps in the transformation. The data coming from the Source step is first read into memory and is then used to look up data from the main stream.
Preserve memory: Encodes rows of data to preserve memory while sorting. (Technical background: Kettle will store the lookup data as raw bytes in a custom storage object that uses a hashcode of the bytes as the key. More CPU cost related to calculating the hashcode, less memory needed.)
Key and value are exactly one integer field: Preserves memory while executing a sort by . Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Use sorted list” option.
(Technical background: The lookup data is stored in a custom storage object that is similar to the byte array hashmap, but it doesn’t have to convert to raw bytes. It just takes a hashcode of the long.)
Use sorted list: Enable to store values using a sorted list; this provides better memory usage when working with data sets containing wide row. Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Key and value are exactly one integer field” option. (Technical background: the lookup data is put into a tuple and stored in a sorted list. Lookups are done via a binary tree search.)