ColdFusion Muse

Removing Duplicate Records in a Database

Mark Kruger February 2, 2006 5:57 PM SQL tips Comments (10)

SQL provides many ways of grouping data. It also has many ways to select data. But if you've ever tried to remove duplicates from a database you might have needed 2 or 3 trips to the liquor store to figure it out. I've seen routines that match and compare and order and update and delete willy nilly - all just to find and fix duplicate rows in a database. I've never had to do this to any of my databases of course, because all my constraints and applications are perfect (gah!). Still, for those of you with imperfect databases I'm happy to report there is a magic bullet. Here's the big secret.

Self Joining

Unless you are Narcissus it's not something you think about often, but it is possible to join a table with itself. This trick can be used to weed out duplicates - as in the query below.

<cfquery name="getDups" datsource="#mydsn#">
   SELECT    distinct a.user_id
   FROM    users a, users b
   WHERE    a.user_id > b.user_id
   AND    a.email = b.email
   ORDER BY a.user_id
</cfquery>
This little beauty would return all the rows with the same email that had been entered after the first record. So if I had 3 records:
useridemail
100myemail@myemail.com
135myemail@myemail.com
210myemail@myemail.com
It would return the 135 and the 210 row. It would NOT return the 100 row.

You can easily use this to your advantage. For example, you could delete the duplicates from the table.

<cfquery name="getDups" datsource="#mydsn#">
   DELETE FROM users
   WHERE   user_id IN
      (SELECT    distinct a.user_id
      FROM    users a, users b
      WHERE    a.user_id > b.user_id
      AND    a.email = b.email)
</cfquery>
That's pretty nifty. It seems so simple, now that I look at it, I wonder why I've never seen it or thought of it before. I guess I just haven't been looking hard enough. I'm sure my readers will be happy to inform me (ha).

  • Share:

Related Blog Entries

10 Comments

  • antony's Gravatar
    Posted By
    antony | 2/2/06 4:47 PM
    Very clever.
    I think people (maybe just me) sometimes forget how powerful sql is. I'm sure most people (maybe just me) would take the cfquery result and do some sort of loop with lots of tests and delete statements to achieve the same result.
  • mkruger's Gravatar
    Posted By
    mkruger | 2/2/06 5:00 PM
    Antony, Yes, you and me both. I've done my share of mucking around with a utility script written in CF. Good for routines where there are a lot of variable possibilities, but this is a nice trick if it suites your purpose.
  • PaulH's Gravatar
    Posted By
    PaulH | 2/2/06 6:44 PM
    another approach would be to create an empty table duplicating the design of that table except to hang a unique index off the email column using the "ignore duplicate key" option. sql server will silently throw away the duplicates when you pump the rows from the original table to this new table.
  • foxjazz's Gravatar
    Posted By
    foxjazz | 2/3/06 12:11 AM
    Well I have a suggestion about dup records.

    I mean finding dups in a field is easy, but true dup records a little harder.
    What you can do is use the sql hash function to create a hash value for every row in the database (listing the rows by number of course).
    Then put that has value in its own table, search on dupes on the hash value, and walla, you have your dup records.

    Just a thought,

    Fox
  • Michael's Gravatar
    Posted By
    Michael | 7/10/08 6:43 PM
    Oh my God I love you!
  • Charles's Gravatar
    Posted By
    Charles | 10/8/08 9:06 AM
    This has been tremendously helpful. Thank YOU!
  • katie A's Gravatar
    Posted By
    katie A | 2/20/09 11:13 AM
    this is AMAZING!!!!!
  • Will Swain's Gravatar
    Posted By
    Will Swain | 3/3/09 6:11 AM
    This is great. Thanks very much. Made the process of deduping 90k records much quicker.
  • Göransv's Gravatar
    Posted By
    Göransv | 1/3/10 4:21 PM
    Sorry to say, but it does not work with mySQL server (openBD, mySQL).

    Type   Database
    Query Error   You can't specify target table 'payoutinfotype' for update in FROM clause
    Datasource test
    Native Error Code   1093
    SQL State   HY000
  • Argh's Gravatar
    Posted By
    Argh | 8/29/12 7:43 PM
    you mofo, i ran this script and it deleted all duplicates...instead of deleting the 2nd or 3rd repeated record it deleted everything including the 1st appearance of the dup....!@#$%^&