Remove Duplicate Stage Example
The Remove Duplicates stage is one of a processing stage. It can have a single input link and a single output link. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set. It has more sophisticated ways to remove duplicates for example we have an option to choose to retain the First or Last duplicate to keep.
Duplicates can be removed by Sort stage as well by using Unique option provided in stage.
But can have below limitations and hence Remove duplicate stage comes into picture.
· No choice on which duplicate to
· Stable sort always retains the first row in the group
· Non stable sort is indeterminate
In Below example we are having employee.txt as input which contains employee ID on which we are going to apply remove duplicate to get unique employees as target.
Design the job similar to as shown below:
Open Properties window of the Remove duplicate stage by double click on it or Right clickàselect Properties from drop down. Select the Key column on which remove duplicates need to perform.
We can also select Last/First duplicate to keep with ‘Duplicate to retain’ option as below.
Under Output à Mapping tab, select the columns from input drag and drop those to output similar to below snapshot to perform mapping of columns from input to output.
Here, we can see we have employee data with unique employeeID as an output.