ColdFusion Muse

You Should Use an Alias for Those Sub Query Tables!

Here's a rule of thumb for you. When you are writing a query that includes a sub query always use an alias for the table and qualify the columns - or use the full column name if you are so inclined. If you don't you may very well end up with unexpected results. Consider this query:

<cfquery name="getUsers" datasource="#myDsn#">
   SELECT   Username, fullname
   FROM   users
   WHERE   username IN
         (SELECT username
          FROM   groupsUsers
          WHERE   groupId = 5)
</cfquery>

Let's just say that the table "groupsUsers" is a cross reference (or bridge) table containing 2 columns, userid and groupid. With that piece of information let's have a quick show of cyber-hands - how many of you think this query will work (and by work I mean actually return values instead of errors)? Not too many of you right? That shows how sophisticated muse readers actually are. It definitely looks like a stinker because there is no "username" column in the "groupsUsers" table - right? So it should error out with a "column not found" type of error. Ah... not so fast.

It Works!

Actually the query does work - though not as intended. You see, because the column name "username" is in the "users" table, when the DB server processes the query it says to itself... "username, username.... let's see, do I have a username column somewhere?" And because you have not specified the table name or an alias it says, "Ah ha! ...here's a username in the users table.". This has the effect of selecting the username from the users table for every row in the users table. To put it another way, the query above is actually doing this:

<cfquery name="getUsers" datasource="#myDsn#">
   SELECT   Username, fullname
   FROM   users
   WHERE   username = username
</cfquery>
The result? Every row in the users table will be returned.

The fix

The fix is to specify the table using an alias or full table qualifier. That's good practice in all but the simplest applications. If you had done that in the example above...

<cfquery name="getUsers" datasource="#myDsn#">
   SELECT   u.Username, u.fullname
   FROM   users u
   WHERE   u.username IN
         (SELECT gu.username
          FROM   groupsUsers gu
          WHERE   gu.groupId = 5)
</cfquery>
... It would have immediately thrown an error and you would have rewritten it as:
<cfquery name="getUsers" datasource="#myDsn#">
   SELECT   u.Username, u.fullname
   FROM   users u
   WHERE   u.userid IN
         (SELECT gu.userid
          FROM   groupsUsers gu
          WHERE   gu.groupId = 5)
</cfquery>
This is yet another example where following some "best practices" will help you avoid mistakes. It would be pretty easy to let the query above slip by because it does actually return records. Hopefully QA and process testing would ferret out such a mistake, but you would probably end up in a straight jacket trying to find and fix the query. Happy coding.

Comments
tony petruzzi's Gravatar And we're not using an INNER JOIN because? Seriously, 99% of sub queries can be written to an INNER JOIN or OUTER JOIN. Subqueries are slow and NOT best practice. I would also recommend using derived table to get rid of large GROUP BY / HAVING statements.
# Posted By tony petruzzi | 3/15/06 3:28 PM
Rick O's Gravatar Not to start a flame war or anything, but saying that subqueries "are slow and not best practice" is at best misleading and at worse complete FUD. It may be true that on some DB systems subqueries are slower than joins (especially early v4/v5 versions of mySQL, IIRC), I challenge you to prove it for all. Making blanket statements like this is just ... wrong. Admittedly, the examples above are not the best justification my little rant here, but I'm making a general case, not defending this specific set.

Sometimes, especially when aggregating data, subqueries are more appropriate and even faster than joins. This is especially true when trying to produce summary data from several tables at once. If you'd like an example of this, I'm sure I can scrounge one up.

And, for those who don't know what all the fuss is about, here's an example of how both myself and Tony P might have rewritten the above example:

SELECT u.Username, u.fullname
FROM users AS u
INNER JOIN groupsUsers AS gu ON (u.userid = gu.userid)
WHERE (gu.groupId = 5)

If you are on a DBMS is that is so old-school as to not like the "INNER JOIN" keyword phrase, this might suit you better:

SELECT u.Username, u.fullname
FROM users AS u, groupsUsers AS gu
WHERE (u.userid = gu.userid)
AND (gu.groupId = 5)

Personally, I find that latter one much harder to read, as I like having my primary/foreign keys right up against they tables they apply to.
# Posted By Rick O | 3/15/06 4:45 PM
Mkruger's Gravatar Tony (and Rick),

Ok... first off - don't pick on my rudimentary examples because they are well.... rudimentary.



Secondly, sub queries are a useful and accepted practice - and they represent (syntactically) an expression of function that makes them useful in certain cases. They express "give me all of the records in THIS table that match something or other in THAT table". They are pretty easy to pick up on. Certainly they can underperform and certainly joins are a better choice in many if not most cases. But I'd say it's just hubris to think that joins are "always" better and there is "never" a place for them - and to therefore conclude they have no place in best practice.


Thirdly, the point of the post was not to stir up a debate about sub queries. I was merely to point out a nuanced programming flaw that looks like a bug but is really the result of a lack of table qualifiers.

Now you boys place nice or I'm gonna delete these comments and take away your birthday :)
# Posted By Mkruger | 3/15/06 5:17 PM
Rick O's Gravatar Just to back up my earlier statement:

Presume that you have a sales system, such as a shopping cart. Your system allows you to place orders on hold so that you can perform credit checks, wait for pre-payment, double-check inventory, or whatever. You'd like a report at the beginning of each day that shows you how much money you could make if you could clear out all of the held orders, sort of as a speedometer for your known earnings potential, and how much work you'd have to do to make that money.

The schema:
Open orders are in Open_Order_Detail (one entry for each line/item)
Order holds are in Held_Orders (one entry for each hold code applied to an order)

The naive approach would be to do this:

SELECT
d.Order_Number,
COUNT(d.Line_Number) AS Items,
SUM(d.Extended_Price) AS Dollars,
COUNT(h.Hold_Code) AS Holds
FROM Open_Order_Detail AS d
LEFT OUTER JOIN Held_Orders AS h
ON (h.Order_Number = d.Order_Number)
GROUP BY d.Order_Number
ORDER BY d.Order_Number

It looks like it gives you a list of open orders, how many items are in each order, how much money the order is worth, and how many holds you need to clear to ship the order. Right?

Nope. If you consider that each order may have multiple items and multiple holds, you'll always end up with a cartesian product on your join. Look at the scenarios:

-- If you have 0 holds, you'll show the correct number of items and holds.
-- If you have 1 item and 2 holds, you'll actually show 2 items and 2 holds, as you'll have 2 lines (one for each hold) before the data is summarized. The item will be duplicated.
-- If you have 2 items and 1 hold, you'll show 2 items and 2 holds, as this time the hold code will be duplicated before summarization.
-- If you have 2 items and 2 holds, you'll get 4 items and 4 holds, as the both the items and holds will be duplicated before summarization.

Tricky, eh? You might try adding DISTINCT inside of your aggregation functions, but not all DBMSes support that, and what if you intentionally duplicate the same item or hold? I've seen plenty of systems that allow duplicate items in orders, especially when items can be customized. Duplicate holds are a bit more rare, but a case where multiple people need to release certain types of holds would not be out of the question. You could also write some tricky CASE statements to handle it, but ... why?

Instead, this query would provide you with what you are actually looking for:

SELECT d.Order_Number, d.Items, d.Dollars, COALESCE(h.Holds,0) AS Holds
FROM (
SELECT Order_Number, COUNT(*) AS Items, SUM(Extended_Price) AS Dollars
FROM Open_Order_Detail
GROUP BY Order_Number
) AS d LEFT OUTER JOIN (
SELECT Order_Number, COUNT(*) AS Holds
FROM Held_Orders
GROUP BY Order_Number
) AS h ON (h.Order_Number = d.Order_Number)
ORDER BY d.Order_Number

As you can see, I aggregate the data *before* I do the join so as to ensure that no row duplicating takes place. This means that I need a COALESCE to replace any NULLs with 0 values, but that's not horrendous.

You should also be able to see that this technique would come in even more use when you have many tables that need to be joined and summarized in this way. Maybe you also want to include a sales dollar total for the last 6 months for the customer who is doing the ordering so you can then prioritize the orders? That's another table that needs to be joined and summarized.

As for performance ... I tested these exact queries on the shopping cart system here at work. There are about 20 thousand orders and 500 hold codes in the system. The first (incorrect) query actually took several seconds longer than the second (correct) query, so not only would I be getting bad data, but I'd also be getting it more slowly. I ran each several times to ensure that all the necessary data was in memory, but the results did not change significantly. When I ran them against my Sales History table, with 2 million records, the speed difference was even more noticeably in favor of the second query.

So there you go.
# Posted By Rick O | 3/15/06 5:28 PM



Blog provided and hosted by CF Webtools. Blog Sofware by Ray Camden.