ColdFusion Muse: Duplicates and Sub Selects

Duplicates and Sub Selects

Mark Kruger November 16, 2006 1:45 PM SQL tips Comments (8)

Here's a sticky problem. How do you build a query that gives you distinct records from one table based on multiple records from another table and order by a date found in the second table. In my real world example, I have project tracks in "Items" and comments or notes in "events". I want a distinct list of "items" that have been updated in the past 10 days. That seems easy right? Well... not as easy as it seems on the surface.

The Problem

Let's say in a project has been "commented on" ("events" table) several times during the 10 days. If I do something like:

<cfquery ....>
        SELECT DISTINCT I.itemid, i.title
        FROM items i join events e
        ON i.itemid = e.itemid
        ORDER by e.datecreated
    </cfquery>

The DB will complain that I cannot order by a column that is not in the SELECT list when using "DISTINCT". If I put the column in the select list I will get duplicates.

Real World Example

Let's a look at the problem a bit closer. Items contains the master record for a particular track. For example, Items are related to individual bugs, feature requests, work orders, or change requests. For example, the Items table might look like:

ItemId	Description

17	Wash Your Wife

22	Take out the dry cleaning

4	Pick up the dog

Events contains comments or updates mapped to the itemid. So events might look lik this:

ItemId	event				Date

17	She Didn't like it			11-15-2006

17	Are you sure about this		11-12-2006

22	My Suite didn't like desert		11-14-2006

4	I should have bought a smaller dog	11-11-2006

If we order by date we will end up with:

ItemId	event				Date

17	She Didn't like it			11-15-2006

22	My Suite didn't like desert		11-14-2006

17	Are you sure about this		11-12-2006

4	I should have bought a smaller dog	11-11-2006

...so any "select distinct" join that includes the date will end up showing

ItemId	Description			date

17	Wash Your Wife			11-15-2006

22	Take out the dry cleaning		11-14-2006

17	Wash Your Wife			11-12-2006

4	Pick up the dog			11-11-2006

What I want is to remove that third line. I already "know" that itemid 17 has been updated recently. I don't need to "know" that it was updated previously - even if it is inside my 10 day window. Sure, I can simply keep track of the itemIds in my loop and not display the ones that have already been handled. But surely there is another way? Here's the solution:

<Cfquery ...>
SELECT     v.itemid, v.title, e.datecreated, e.eventid
FROM     items v
    INNER JOIN Events e
        ON v.itemid = e.itemid
WHERE     e.eventid in (
    SELECT     top 1 e2.eventid
    FROM     events e2
    WHERE     e2.itemid = v.itemid
    ORDER BY e2.datecreated desc
    )
ORDER BY e.datecreated desc
</CFQUERY>

What's the secret? The subsect filters out all but the latest eventid - and that effects the date that is pulled in from the eventID table. This has the effect of removing that second level of items.

NOTE: Thanks to my good friend and SQL guru Mike Klostermeyer for this tip.

Related Blog Entries

Removing Duplicate Records in a Database (February 2, 2006)

8 Comments

Posted By
spiraldev | 11/16/06 1:44 PM

Why not do this ?
<code>
<Cfquery ...>
SELECT v.itemid, v.title, e.datecreated, e.eventid
FROM items v
INNER JOIN Events e
ON v.itemid = e.itemid
GROUP BY v.itemid, v.title, e.datecreated, e.eventid
ORDER BY e.datecreated desc
</CFQUERY>
</code>

should be faster.
Posted By
spiraldev | 11/16/06 1:47 PM

Oops
I meant
<Cfquery ...>
SELECT v.itemid, v.title, e.eventid
FROM items v
INNER JOIN Events e
ON v.itemid = e.itemid
GROUP BY v.itemid, v.title, e.datecreated, e.eventid
ORDER BY e.datecreated desc
</CFQUERY>
Posted By
spiraldev | 11/16/06 1:48 PM

sorry
<Cfquery ...>
SELECT v.itemid, v.title, e.eventid
FROM items v
INNER JOIN Events e
ON v.itemid = e.itemid
GROUP BY v.itemid, v.title, e.eventid
ORDER BY e.datecreated desc
</CFQUERY>
Posted By
Brian Kotek | 11/16/06 3:38 PM

Correlated subqueries should be avoided as much as possible because they are usually very slow (especially for large data sets). An alternative would be to run an inline view to get the top 1 event id and item id, and then joining this result set to your Events table.
Posted By
Brian Kotek | 11/16/06 3:48 PM

Actually, in this case it looks like you could also use a WHERE EXISTS clause, which is also almost always faster than an IN clause that uses a correlated subquery.
Posted By
Rick O | 11/17/06 11:23 AM

Why not:

SELECT i.ItemID, i.Description, MAX(e.Created) AS LastUpdate
FROM items AS i
INNER JOIN events AS e ON (i.ItemID = e.ItemID)
GROUP BY i.ItemID, i.Description
HAVING MAX(e.Created) > DATEADD(DAY,-10,GETDATE())
ORDER BY LastUpdate DESC

Of course, that's got a bit of SQL Server-specific date magic, but you get the point. If you want the EventID that corresponds to that date, things get a little trickier:

SELECT i.ItemID, i.Description, f.EventID AS EventID, f.Created AS LastUpdate
FROM items AS i
INNER JOIN events AS e ON (i.ItemID = e.ItemID)
INNER JOIN events AS f ON (f.ItemID = i.ItemID)
GROUP BY i.ItemID, i.Description, f.EventID, f.Created
HAVING MAX(e.Created) > DATEADD(DAY,-10,GETDATE()) AND (f.Created = MAX(e.Created))
ORDER BY LastUpdate DESC

I know that looks heinous, with a join via the HAVING clause, but if you run it you'll see that it works.

Think of it this way: you need the Events table twice, once aggregated (to find the max date) and once unaggregated (to find the EventID that goes with that max date). The "e" alias is the aggregated version, hence why none of its columns show up in the GROUP BY clause. The "f" alias is the unaggregated version, so you can then tie back to the "e" table. You can't join in the ON clause like you normally would, because that happens before the aggregation, so you have to join in the HAVING clause, which happens after.

If your event dates really are just days and not full (unique) timestamps, you need to wrap an aggregating function like MIN or MAX around the f.EventID in the SELECT list and take it out of the GROUP BY list. This will pick just one of the Events from that same day and you won't get duplicate rows. You would still leave the f.Created in the GROUP BY, however.

Is that twisted, or what?
Posted By
mkruger | 11/18/06 10:43 AM

Rick,

Whoa... that is twisted :) I think there is a future for you in writing the US tax code. That's an excellent and smart solution - and I bet it performs better than the sub select. Thanks for the input!

-Mark
Posted By
jM | 3/3/07 2:03 AM

Just to throw another option into the mix ..

Why not use derived tables? They're straightforward and likely outperform all of the queries above.

SELECT i.Itemid, i.Title, e.DateCreated AS LastUpdate
FROM (
      SELECT   ItemId, MAX(DateCreated) AS DateCreated
      FROM   Events
      WHERE   DateCreated > @tenDaysAgo
      GROUP BY ItemId
   )
   AS e INNER JOIN Items i ON e.ItemID = i.ItemId
ORDER BY e.DateCreated desc

SELECT i.Itemid, i.Title, e.DateCreated, MAX(e.eventId) AS EventId
FROM (
      SELECT   ItemId, MAX(DateCreated) AS DateCreated
      FROM   Events
      WHERE   DateCreated > @tenDaysAgo
      GROUP BY ItemId
   )
   AS mx
   INNER JOIN Items i ON mx.ItemID = i.ItemId
   INNER JOIN Events e ON mx.ItemID = e.ItemID
      AND mx.DateCreated = e.DateCreated
GROUP BY i.Itemid, i.Title, e.DateCreated
ORDER BY e.DateCreated desc

> If you want the EventID that corresponds to that date,
> things get a little trickier:

You might check that query. It looks like it could return more than one row per itemId if two events have the same timestamp. Unlikely, but possible.

ColdFusion Muse

Duplicates and Sub Selects

The Problem

Real World Example

Related Blog Entries

8 Comments

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Recent Entries

Google Search

Sites to Watch

Other Guru Blogs

CF Webtools Blogs

Fan Pages & Support Groups

Subscribe

Recent Comments

Archives By Subject

Twitter

Facebook

Aggregators

RSS

inquiries