Wednesday, 19 September 2012

Selecting duplicates in a database table

Selecting duplicates in a database table can be accomplished in various ways and all these approaches involve various trade-offs with respect to efficiency and speed but before we start examining some of these approaches we need a table with duplicates so here is one. The Primary Key in this table is the ID and the Emp_Id is meant to be Unique and will be used for filtering duplicates.


Listing the number of times each record occurs with duplicate rows listed once.

(1) select *, count(*) from employee group by Emp_Id;



Listing the number of times each record occurs with a listing of the first row in the group.

(2) select *, count(*) from employee group by Emp_Id having count(*) > 1;



To list all duplicate rows, the following query would work;


(3) select   a.* from
    employee as a
inner join
    (select Emp_Id
     from employee
     group by Emp_Id
     having count(*) = 2) as b
  on a.Emp_Id = b.Emp_Id;

Now our table has a Tele_No column and some of these numbers are incorrect

Selecting the duplicates using a self-join.

In this case, we need to select all records that have the same Emp_Id but a different Tele_No. We can use a simple self-join with the Tele_No and the column as the distinguishing key to separate our records.
Option 1

(4) select e1.* from employee e1, employee e2 where e1.Emp_Id = e2.Emp_Id and e1.Tele_No != e2.Tele_No order by e1.Id asc;




The exact same result could also be obtained with a sub-query as follows:
Option 2

(5) select e1.* from
employee e1 where e1.Emp_Id IN (
select e2.Emp_Id from employee e2
where e1.Emp_Id = e2.Emp_Id
and e1.Tele_No != e2.Tele_No
order by e1.Id asc;

In Option 1 & 2, we have the condition e1.Tele_No != e2.Tele_No that prevents us from getting a cartesian product of the records. If such a condition is not available then just use the Primary Key column as follows:


(6) select e1.* from
employee e1 where e1.Emp_Id IN (
select e2.Emp_Id from employee e2
where e1.Emp_Id = e2.Emp_Id
and e1.ID <> e2.ID
)
order by e1.Id asc;


If you are wondering whether a join or a sub-query would work best, it would ofcourse depend upon your query optimizer and the size of your tables but generally Joins are the way to go.

So, we have 6 records where the Emp_Id are the same but the Tele_Nos are different. To view these records in a tabular format we could modify (6) as follows.

(7) select e1.ID, e1.Tele_No,e2.ID, e2.Tele_no
from
employee e1, employee e2
         where e1.Emp_Id = e2.Emp_Id
               and e1.Tele_No != e2.Tele_No
               and e1.Id < e2.id
        order by e1.Id asc;



The change in the condition comparing the Primary Keys in the query above e1.Id < e2.id prevents a cartesian product.

No comments: