Remove Duplicate Stage Example

The Remove Duplicates stage is one of a processing stage. It can have a single input link and a single output link. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set. It has more sophisticated ways to remove duplicates for example we have an option to choose to retain the First or Last duplicate to keep.

Duplicates can be removed by Sort stage as well by using Unique option provided in stage.

But can have below limitations and hence Remove duplicate stage comes into picture.

      · No choice on which duplicate to

      · Stable sort always retains the first row in the group

      · Non stable sort is indeterminate 

Example:

In Below example we are having employee.txt as input which contains employee ID on which we are going to apply remove duplicate to get unique employees as target.

Design the job similar to as shown below:

r1 - Remove Duplicates Stage Example

Input Data:

r2 - Remove Duplicates Stage Example

Open Properties window of the Remove duplicate stage by double click on it or Right clickàselect Properties from drop down. Select the Key column on which remove duplicates need to perform.

We can also select Last/First duplicate to keep with ‘Duplicate to retain’ option as below.

r3 - Remove Duplicates Stage Example

Under Output à Mapping tab, select the columns from input drag and drop those to output similar to below snapshot to perform mapping of columns from input to output.

r4 - Remove Duplicates Stage Example
Save the job and compile and run it. After successful job run we can find 6 records have been inserted to the target Dataset out of 11 records. 
r5 - Remove Duplicates Stage Example

Output Data:

Here, we can see we have employee data with unique employeeID as an output.

r6 - Remove Duplicates Stage Example