ColdFusion Muse: Eliminating Duplicates In a Query When Distinct is not an Option

Eliminating Duplicates In a Query When Distinct is not an Option

Mark Kruger October 27, 2006 6:28 PM MS SQL Server Comments (7)

Here's a problem perhaps you have had. You want to select unique email addresses, first names and last names out of a database for a newsletter or to sell them a new mortgage or whatever. Being the nice guy that you are you don't want to send them multiple messages, so you want to eliminate duplicates, right? the problem is that SELECT DISTINCT... doesn't always work in this instance. For example, John Doe put his information in as John Doe in one case and John H. Doe in another. Selecting distinct for name and email will give you a duplicate name with the same email. Now obviously you could solve this problem in your Coldfusion code - but wouldn't it be nice to fix up the query?

This is exactly the problem we ran into recently. We had people signing up for "seminars" for a client and we would end up with duplications from folks who had signed up more than once. The fix? Join the table to itself, group by and use HAVING - like so:

<cfquery name="getEmails" datasource="#mydsn#">
    SELECT su.userid, su.seminarid as id, su.fname, su.lname, su.email
    FROM seminarusers su
    INNER JOIN seminarusers su2 ON su.uuid = su2.uuid
    WHERE su.seminarid IN (5,6)
    AND su2.seminarid IN (5,6)
    GROUP BY su.userid, su.seminarid, su.fname, su.lname, su.email
    HAVING su.userid = Max(su2.userid)
</cfquery>

In this example we joined the table to itself, grouped by the userid and then used having to get only the latest record. Very nifty! Thanks to CF Webtools developer Ryan Stille for the neat trick. One thing to note, the where clause has to apply to BOTH of the table aliases or you will get an unexpected result - so we did WHERE su.seminarid IN (5,6) AND su2.seminarid IN (5,6). You could just as easily remove duplicates this way (see my post on Updating a table using a Join).

Related Blog Entries

Leveraging Your SQL - Update Using a Join (July 15, 2005)

7 Comments

Posted By
Ben Nadel | 10/27/06 4:53 PM

Niiiice :)
Posted By
Sami Hoda | 10/27/06 6:06 PM

Pretty slick... Any performance considerations?
Posted By
Brian Kotek | 10/27/06 7:37 PM

You can also do this by doing an inline view that just gets the distinct userID and email, and then join the results of the inline view to a the table to get the user's name.
Posted By
Brian Kotek | 10/27/06 7:39 PM

Also, I must point out that if you made userID unique you wouldn't have to worry about duplicates. However, there are other uses for this same approach. For example, if a field you are selecting on is a text field, in SQL Server you can't include it in a query using distinct. You can get around that by selecting distinct on everything else in an inline view, then joining again to the table based on the results of the inline view to get the text field.
Posted By
mkruger | 10/28/06 8:38 AM

Brian,

Thanks - nice tip on the inline view. I think I have a blog post on that somewhere as well. As for making the userID unique... it IS unique. The problem is legacy data where the first and lastname may NOT be unique resulting in what I guess I should call "psuedo" duplicates - where the user is actually the same person, but "some" of the data doesn't match.
Posted By
Rick O | 10/29/06 12:17 AM

Why not just cheat?

SELECT
MAX(su.userid) AS userid,
MAX(su.fname) AS fname,
MAX(su.seminarid) AS id,
su.email
FROM seminarusers AS su
GROUP BY su.email

Sure, you may get a mix-and-match of first names and last names (Adam Zellar + Betsy Ross = Betsy Zellar) ... but does it matter? You're not going to have two people with two different last names sign up with the same email address. Right?
Posted By
Ryan | 10/30/06 4:13 PM

Rick - that works, if you don't need the precision. As you indicated, you'll end up with mismatched userids and names, which could come back to bite you later, depending on what you are doing with the results of this query. In our case, in the case of duplicates we wanted to have the most recent registrant.

ColdFusion Muse

Eliminating Duplicates In a Query When Distinct is not an Option

Related Blog Entries

7 Comments

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Recent Entries

Google Search

Sites to Watch

Other Guru Blogs

CF Webtools Blogs

Fan Pages & Support Groups

Subscribe

Recent Comments

Archives By Subject

Twitter

Facebook

Aggregators

RSS

inquiries