Multi valued column processing using Talend

Introduction

Now days, source data files which are exported from the different database structures are having multiple fields in single column. These fields need to be extracted and processed independently and need to be identified with the key column for corresponding row. This specific scenario can’t be handled independently without the use of ETL tool required in data migration activities. Talend provides a component called as tNormalize. This component normalizes input flow by following SQL standard.  In order to understand the business use case and its importance let us take an example of Patient records in hospital.

Sample Data

Following is the sample data extract in which last column is Percentile having two columns separated by “|” pipe operator. These columns (Percentile_1~Percentile_2|Percentile _1~Percentile_2|Percentile_1~ Percentile_2) need to be extracted to another table with respect to PlanID. This combination could be repeated for n number of times.

Measurement Year,Payer,Plan ID,Plan Name,Domain,Sub Domain,Measure Description,Measure,Method of Data Collection,Eligible Population,Denominator,Numerator,Rate,Level of Significance,Percentile

2012,Child Health Plus, 1080382,Statewide Average,Child and Adolescent Health,Caring for Children and Adolescents with Illnesses,Follow-Up Care for Children Prescribed ADHD Medication:Continuation Phase,ADHD-Continuation,Administrative Review,633,633,355,56,,63~63|53~47

2012,Child Health Plus,1080383,MVP,Child and Adolescent Health,Caring for Children and Adolescents with Illnesses,Follow-Up Care for Children Prescribed ADHD Medication:Continuation Phase,ADHD-Continuation,Administrative Review,,,,,Sample size too small to report,~|~

2012,Child Health Plus,1090384,CDPHP,Child and Adolescent Health,Caring for Children and Adolescents with Illnesses,Follow-Up Care for Children Prescribed ADHD Medication:Continuation Phase,ADHD-Continuation,Administrative Review,54,54,32,59,,61~61|52~48

Job Design

In above data PlanID is a unique primary key in addition to couple of other columns… when extracted to another table we would be needing all the percentile for each PlanID. This problem could be solved by using the tNormalize component. Following is the job flow design to extract PlanID and multi-valued Percentile column

Multi valued column talend job

Multi valued column talend job

Pipe separator configuration

Pipe separator configuration

Separator needs to be escaped by “\\” for pipe operator or having similar special character. After executing the job, you will get records as follows

Expected output

PlanID, Percentile_1,Percentile_2

1080382,63,63

1080382,53,47

1080283,,

1080283,,

1080384,61,61

1080384,52,48

– Vaibhav Waghmare

Leave a Reply