ColdFusion Muse

Duplicates and Sub Selects

Mark Kruger November 16, 2006 1:45 PM SQL tips Comments (8)

Here's a sticky problem. How do you build a query that gives you distinct records from one table based on multiple records from another table and order by a date found in the second table. In my real world example, I have project tracks in "Items" and comments or notes in "events". I want a distinct list of "items" that have been updated in the past 10 days. That seems easy right? Well... not as easy as it seems on the surface.

The Problem

Let's say in a project has been "commented on" ("events" table) several times during the 10 days. If I do something like:

<cfquery ....>
        SELECT DISTINCT I.itemid, i.title
        FROM items i join events e
        ON i.itemid = e.itemid            
        ORDER by e.datecreated
    </cfquery>
The DB will complain that I cannot order by a column that is not in the SELECT list when using "DISTINCT". If I put the column in the select list I will get duplicates.

Real World Example

Let's a look at the problem a bit closer. Items contains the master record for a particular track. For example, Items are related to individual bugs, feature requests, work orders, or change requests. For example, the Items table might look like:

ItemId	Description
17 Wash Your Wife
22 Take out the dry cleaning
4 Pick up the dog
Events contains comments or updates mapped to the itemid. So events might look lik this:
ItemId	event				Date
17 She Didn't like it 11-15-2006
17 Are you sure about this 11-12-2006
22 My Suite didn't like desert 11-14-2006
4 I should have bought a smaller dog 11-11-2006
If we order by date we will end up with:
ItemId	event				Date
17 She Didn't like it 11-15-2006
22 My Suite didn't like desert 11-14-2006
17 Are you sure about this 11-12-2006
4 I should have bought a smaller dog 11-11-2006
...so any "select distinct" join that includes the date will end up showing

ItemId	Description			date
17 Wash Your Wife 11-15-2006
22 Take out the dry cleaning 11-14-2006
17 Wash Your Wife 11-12-2006
4 Pick up the dog 11-11-2006

What I want is to remove that third line. I already "know" that itemid 17 has been updated recently. I don't need to "know" that it was updated previously - even if it is inside my 10 day window. Sure, I can simply keep track of the itemIds in my loop and not display the ones that have already been handled. But surely there is another way? Here's the solution:

<Cfquery ...>
SELECT     v.itemid, v.title, e.datecreated, e.eventid
FROM     items v
    INNER JOIN Events e
        ON v.itemid = e.itemid
WHERE     e.eventid in (
    SELECT     top 1 e2.eventid
    FROM     events e2
    WHERE     e2.itemid = v.itemid
    ORDER BY e2.datecreated desc
    )
ORDER BY e.datecreated desc
</CFQUERY>

What's the secret? The subsect filters out all but the latest eventid - and that effects the date that is pulled in from the eventID table. This has the effect of removing that second level of items.

NOTE: Thanks to my good friend and SQL guru Mike Klostermeyer for this tip.

  • Share:

Related Blog Entries

8 Comments

  • spiraldev's Gravatar
    Posted By
    spiraldev | 11/16/06 1:44 PM
    Why not do this ?
    <code>
    <Cfquery ...>
    SELECT v.itemid, v.title, e.datecreated, e.eventid
    FROM items v
    INNER JOIN Events e
    ON v.itemid = e.itemid
    GROUP BY v.itemid, v.title, e.datecreated, e.eventid
    ORDER BY e.datecreated desc
    </CFQUERY>
    </code>

    should be faster.
  • spiraldev's Gravatar
    Posted By
    spiraldev | 11/16/06 1:47 PM
    Oops
    I meant
    <Cfquery ...>
    SELECT v.itemid, v.title, e.eventid
    FROM items v
    INNER JOIN Events e
    ON v.itemid = e.itemid
    GROUP BY v.itemid, v.title, e.datecreated, e.eventid
    ORDER BY e.datecreated desc
    </CFQUERY>
  • spiraldev's Gravatar
    Posted By
    spiraldev | 11/16/06 1:48 PM
    sorry
    <Cfquery ...>
    SELECT v.itemid, v.title, e.eventid
    FROM items v
    INNER JOIN Events e
    ON v.itemid = e.itemid
    GROUP BY v.itemid, v.title, e.eventid
    ORDER BY e.datecreated desc
    </CFQUERY>
  • Brian Kotek's Gravatar
    Posted By
    Brian Kotek | 11/16/06 3:38 PM
    Correlated subqueries should be avoided as much as possible because they are usually very slow (especially for large data sets). An alternative would be to run an inline view to get the top 1 event id and item id, and then joining this result set to your Events table.
  • Brian Kotek's Gravatar
    Posted By
    Brian Kotek | 11/16/06 3:48 PM
    Actually, in this case it looks like you could also use a WHERE EXISTS clause, which is also almost always faster than an IN clause that uses a correlated subquery.
  • Rick O's Gravatar
    Posted By
    Rick O | 11/17/06 11:23 AM
    Why not:

    SELECT i.ItemID, i.Description, MAX(e.Created) AS LastUpdate
    FROM items AS i
    INNER JOIN events AS e ON (i.ItemID = e.ItemID)
    GROUP BY i.ItemID, i.Description
    HAVING MAX(e.Created) > DATEADD(DAY,-10,GETDATE())
    ORDER BY LastUpdate DESC

    Of course, that's got a bit of SQL Server-specific date magic, but you get the point. If you want the EventID that corresponds to that date, things get a little trickier:

    SELECT i.ItemID, i.Description, f.EventID AS EventID, f.Created AS LastUpdate
    FROM items AS i
    INNER JOIN events AS e ON (i.ItemID = e.ItemID)
    INNER JOIN events AS f ON (f.ItemID = i.ItemID)
    GROUP BY i.ItemID, i.Description, f.EventID, f.Created
    HAVING MAX(e.Created) > DATEADD(DAY,-10,GETDATE()) AND (f.Created = MAX(e.Created))
    ORDER BY LastUpdate DESC

    I know that looks heinous, with a join via the HAVING clause, but if you run it you'll see that it works.

    Think of it this way: you need the Events table twice, once aggregated (to find the max date) and once unaggregated (to find the EventID that goes with that max date). The "e" alias is the aggregated version, hence why none of its columns show up in the GROUP BY clause. The "f" alias is the unaggregated version, so you can then tie back to the "e" table. You can't join in the ON clause like you normally would, because that happens before the aggregation, so you have to join in the HAVING clause, which happens after.

    If your event dates really are just days and not full (unique) timestamps, you need to wrap an aggregating function like MIN or MAX around the f.EventID in the SELECT list and take it out of the GROUP BY list. This will pick just one of the Events from that same day and you won't get duplicate rows. You would still leave the f.Created in the GROUP BY, however.

    Is that twisted, or what?
  • mkruger's Gravatar
    Posted By
    mkruger | 11/18/06 10:43 AM
    Rick,

    Whoa... that is twisted :) I think there is a future for you in writing the US tax code. That's an excellent and smart solution - and I bet it performs better than the sub select. Thanks for the input!

    -Mark
  • jM's Gravatar
    Posted By
    jM | 3/3/07 2:03 AM
    Just to throw another option into the mix ..

    Why not use derived tables? They're straightforward and likely outperform all of the queries above.

    SELECT i.Itemid, i.Title, e.DateCreated AS LastUpdate
    FROM (    
          SELECT   ItemId, MAX(DateCreated) AS DateCreated
          FROM   Events
          WHERE   DateCreated > @tenDaysAgo
          GROUP BY ItemId
       )
       AS e INNER JOIN Items i ON e.ItemID = i.ItemId
    ORDER BY e.DateCreated desc

    SELECT i.Itemid, i.Title, e.DateCreated, MAX(e.eventId) AS EventId
    FROM (    
          SELECT   ItemId, MAX(DateCreated) AS DateCreated
          FROM   Events
          WHERE   DateCreated > @tenDaysAgo
          GROUP BY ItemId
       )
       AS mx
       INNER JOIN Items i ON mx.ItemID = i.ItemId
       INNER JOIN Events e ON mx.ItemID = e.ItemID
          AND mx.DateCreated = e.DateCreated
    GROUP BY i.Itemid, i.Title, e.DateCreated
    ORDER BY e.DateCreated desc

    > If you want the EventID that corresponds to that date,
    > things get a little trickier:

    You might check that query. It looks like it could return more than one row per itemId if two events have the same timestamp. Unlikely, but possible.