ColdFusion Muse

Removing Duplicate Records in a Database

SQL provides many ways of grouping data. It also has many ways to select data. But if you've ever tried to remove duplicates from a database you might have needed 2 or 3 trips to the liquor store to figure it out. I've seen routines that match and compare and order and update and delete willy nilly - all just to find and fix duplicate rows in a database. I've never had to do this to any of my databases of course, because all my constraints and applications are perfect (gah!). Still, for those of you with imperfect databases I'm happy to report there is a magic bullet. Here's the big secret.

Self Joining

Unless you are Narcissus it's not something you think about often, but it is possible to join a table with itself. This trick can be used to weed out duplicates - as in the query below.

<cfquery name="getDups" datsource="#mydsn#">
   SELECT    distinct a.user_id
   FROM    users a, users b
   WHERE    a.user_id > b.user_id
   AND    a.email = b.email
   ORDER BY a.user_id
</cfquery>
This little beauty would return all the rows with the same email that had been entered after the first record. So if I had 3 records:
useridemail
100myemail@myemail.com
135myemail@myemail.com
210myemail@myemail.com
It would return the 135 and the 210 row. It would NOT return the 100 row.

You can easily use this to your advantage. For example, you could delete the duplicates from the table.

<cfquery name="getDups" datsource="#mydsn#">
   DELETE FROM users
   WHERE   user_id IN
      (SELECT    distinct a.user_id
      FROM    users a, users b
      WHERE    a.user_id > b.user_id
      AND    a.email = b.email)
</cfquery>
That's pretty nifty. It seems so simple, now that I look at it, I wonder why I've never seen it or thought of it before. I guess I just haven't been looking hard enough. I'm sure my readers will be happy to inform me (ha).

Related Blog Entries

Comments
antony's Gravatar Very clever.
I think people (maybe just me) sometimes forget how powerful sql is. I'm sure most people (maybe just me) would take the cfquery result and do some sort of loop with lots of tests and delete statements to achieve the same result.
# Posted By antony | 2/2/06 4:47 PM
mkruger's Gravatar Antony, Yes, you and me both. I've done my share of mucking around with a utility script written in CF. Good for routines where there are a lot of variable possibilities, but this is a nice trick if it suites your purpose.
# Posted By mkruger | 2/2/06 5:00 PM
PaulH's Gravatar another approach would be to create an empty table duplicating the design of that table except to hang a unique index off the email column using the "ignore duplicate key" option. sql server will silently throw away the duplicates when you pump the rows from the original table to this new table.
# Posted By PaulH | 2/2/06 6:44 PM
foxjazz's Gravatar Well I have a suggestion about dup records.

I mean finding dups in a field is easy, but true dup records a little harder.
What you can do is use the sql hash function to create a hash value for every row in the database (listing the rows by number of course).
Then put that has value in its own table, search on dupes on the hash value, and walla, you have your dup records.

Just a thought,

Fox
# Posted By foxjazz | 2/3/06 12:11 AM
Michael's Gravatar Oh my God I love you!
# Posted By Michael | 7/10/08 6:43 PM
Charles's Gravatar This has been tremendously helpful. Thank YOU!
# Posted By Charles | 10/8/08 9:06 AM
katie A's Gravatar this is AMAZING!!!!!
# Posted By katie A | 2/20/09 11:13 AM
Will Swain's Gravatar This is great. Thanks very much. Made the process of deduping 90k records much quicker.
# Posted By Will Swain | 3/3/09 6:11 AM
Göransv's Gravatar Sorry to say, but it does not work with mySQL server (openBD, mySQL).

Type   Database
Query Error   You can't specify target table 'payoutinfotype' for update in FROM clause
Datasource test
Native Error Code   1093
SQL State   HY000
# Posted By Göransv | 1/3/10 4:21 PM
Argh's Gravatar you mofo, i ran this script and it deleted all duplicates...instead of deleting the 2nd or 3rd repeated record it deleted everything including the 1st appearance of the dup....!@#$%^&
# Posted By Argh | 8/29/12 7:43 PM



Blog provided and hosted by CF Webtools. Blog Sofware by Ray Camden.