ColdFusion Muse: Removing Duplicate Records in a Database

Removing Duplicate Records in a Database

Mark Kruger February 2, 2006 5:57 PM SQL tips Comments (10)

SQL provides many ways of grouping data. It also has many ways to select data. But if you've ever tried to remove duplicates from a database you might have needed 2 or 3 trips to the liquor store to figure it out. I've seen routines that match and compare and order and update and delete willy nilly - all just to find and fix duplicate rows in a database. I've never had to do this to any of my databases of course, because all my constraints and applications are perfect (gah!). Still, for those of you with imperfect databases I'm happy to report there is a magic bullet. Here's the big secret.

Self Joining

Unless you are Narcissus it's not something you think about often, but it is possible to join a table with itself. This trick can be used to weed out duplicates - as in the query below.

<cfquery name="getDups" datsource="#mydsn#">
   SELECT    distinct a.user_id
   FROM    users a, users b
   WHERE    a.user_id > b.user_id
   AND    a.email = b.email
   ORDER BY a.user_id
</cfquery>

This little beauty would return all the rows with the same email that had been entered after the first record. So if I had 3 records:

userid	email
100	myemail@myemail.com
135	myemail@myemail.com
210	myemail@myemail.com

It would return the 135 and the 210 row. It would NOT return the 100 row.

You can easily use this to your advantage. For example, you could delete the duplicates from the table.

<cfquery name="getDups" datsource="#mydsn#">
   DELETE FROM users
   WHERE   user_id IN
      (SELECT    distinct a.user_id
      FROM    users a, users b
      WHERE    a.user_id > b.user_id
      AND    a.email = b.email)
</cfquery>

That's pretty nifty. It seems so simple, now that I look at it, I wonder why I've never seen it or thought of it before. I guess I just haven't been looking hard enough. I'm sure my readers will be happy to inform me (ha).

Related Blog Entries

Duplicates and Sub Selects (November 16, 2006)

10 Comments

Posted By
antony | 2/2/06 4:47 PM

Very clever.
I think people (maybe just me) sometimes forget how powerful sql is. I'm sure most people (maybe just me) would take the cfquery result and do some sort of loop with lots of tests and delete statements to achieve the same result.
Posted By
mkruger | 2/2/06 5:00 PM

Antony, Yes, you and me both. I've done my share of mucking around with a utility script written in CF. Good for routines where there are a lot of variable possibilities, but this is a nice trick if it suites your purpose.
Posted By
PaulH | 2/2/06 6:44 PM

another approach would be to create an empty table duplicating the design of that table except to hang a unique index off the email column using the "ignore duplicate key" option. sql server will silently throw away the duplicates when you pump the rows from the original table to this new table.
Posted By
foxjazz | 2/3/06 12:11 AM

Well I have a suggestion about dup records.

I mean finding dups in a field is easy, but true dup records a little harder.
What you can do is use the sql hash function to create a hash value for every row in the database (listing the rows by number of course).
Then put that has value in its own table, search on dupes on the hash value, and walla, you have your dup records.

Just a thought,

Fox
Posted By
Michael | 7/10/08 6:43 PM

Oh my God I love you!
Posted By
Charles | 10/8/08 9:06 AM

This has been tremendously helpful. Thank YOU!
Posted By
katie A | 2/20/09 11:13 AM

this is AMAZING!!!!!
Posted By
Will Swain | 3/3/09 6:11 AM

This is great. Thanks very much. Made the process of deduping 90k records much quicker.
Posted By
Göransv | 1/3/10 4:21 PM

Sorry to say, but it does not work with mySQL server (openBD, mySQL).

Type   Database
Query Error   You can't specify target table 'payoutinfotype' for update in FROM clause
Datasource test
Native Error Code   1093
SQL State   HY000
Posted By
Argh | 8/29/12 7:43 PM

you mofo, i ran this script and it deleted all duplicates...instead of deleting the 2nd or 3rd repeated record it deleted everything including the 1st appearance of the dup....!@#$%^&

ColdFusion Muse

Removing Duplicate Records in a Database

Self Joining

Related Blog Entries

10 Comments

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Recent Entries

Google Search

Sites to Watch

Other Guru Blogs

CF Webtools Blogs

Fan Pages & Support Groups

Subscribe

Recent Comments

Archives By Subject

Twitter

Facebook

Aggregators

RSS

inquiries