ColdFusion Muse

Eliminating Duplicates In a Query When Distinct is not an Option

Mark Kruger October 27, 2006 6:28 PM MS SQL Server Comments (7)

Here's a problem perhaps you have had. You want to select unique email addresses, first names and last names out of a database for a newsletter or to sell them a new mortgage or whatever. Being the nice guy that you are you don't want to send them multiple messages, so you want to eliminate duplicates, right? the problem is that SELECT DISTINCT... doesn't always work in this instance. For example, John Doe put his information in as John Doe in one case and John H. Doe in another. Selecting distinct for name and email will give you a duplicate name with the same email. Now obviously you could solve this problem in your Coldfusion code - but wouldn't it be nice to fix up the query?

This is exactly the problem we ran into recently. We had people signing up for "seminars" for a client and we would end up with duplications from folks who had signed up more than once. The fix? Join the table to itself, group by and use HAVING - like so:

<cfquery name="getEmails" datasource="#mydsn#">
    SELECT su.userid, su.seminarid as id, su.fname, su.lname, su.email
    FROM seminarusers su
    INNER JOIN seminarusers su2 ON su.uuid = su2.uuid
    WHERE su.seminarid IN (5,6)
    AND su2.seminarid IN (5,6)
    GROUP BY su.userid, su.seminarid, su.fname, su.lname, su.email
    HAVING su.userid = Max(su2.userid)
</cfquery>
In this example we joined the table to itself, grouped by the userid and then used having to get only the latest record. Very nifty! Thanks to CF Webtools developer Ryan Stille for the neat trick. One thing to note, the where clause has to apply to BOTH of the table aliases or you will get an unexpected result - so we did WHERE su.seminarid IN (5,6) AND su2.seminarid IN (5,6). You could just as easily remove duplicates this way (see my post on Updating a table using a Join).

  • Share:

Related Blog Entries

7 Comments

  • Ben Nadel's Gravatar
    Posted By
    Ben Nadel | 10/27/06 4:53 PM
    Niiiice :)
  • Sami Hoda's Gravatar
    Posted By
    Sami Hoda | 10/27/06 6:06 PM
    Pretty slick... Any performance considerations?
  • Brian Kotek's Gravatar
    Posted By
    Brian Kotek | 10/27/06 7:37 PM
    You can also do this by doing an inline view that just gets the distinct userID and email, and then join the results of the inline view to a the table to get the user's name.
  • Brian Kotek's Gravatar
    Posted By
    Brian Kotek | 10/27/06 7:39 PM
    Also, I must point out that if you made userID unique you wouldn't have to worry about duplicates. However, there are other uses for this same approach. For example, if a field you are selecting on is a text field, in SQL Server you can't include it in a query using distinct. You can get around that by selecting distinct on everything else in an inline view, then joining again to the table based on the results of the inline view to get the text field.
  • mkruger's Gravatar
    Posted By
    mkruger | 10/28/06 8:38 AM
    Brian,

    Thanks - nice tip on the inline view. I think I have a blog post on that somewhere as well. As for making the userID unique... it IS unique. The problem is legacy data where the first and lastname may NOT be unique resulting in what I guess I should call "psuedo" duplicates - where the user is actually the same person, but "some" of the data doesn't match.
  • Rick O's Gravatar
    Posted By
    Rick O | 10/29/06 12:17 AM
    Why not just cheat?

    SELECT
    MAX(su.userid) AS userid,
    MAX(su.fname) AS fname,
    MAX(su.seminarid) AS id,
    su.email
    FROM seminarusers AS su
    GROUP BY su.email

    Sure, you may get a mix-and-match of first names and last names (Adam Zellar + Betsy Ross = Betsy Zellar) ... but does it matter? You're not going to have two people with two different last names sign up with the same email address. Right?
  • Ryan's Gravatar
    Posted By
    Ryan | 10/30/06 4:13 PM
    Rick - that works, if you don't need the precision. As you indicated, you'll end up with mismatched userids and names, which could come back to bite you later, depending on what you are doing with the results of this query. In our case, in the case of duplicates we wanted to have the most recent registrant.