Here's a rule of thumb for you. When you are writing a query that includes a sub query always use an alias for the table and qualify the columns - or use the full column name if you are so inclined. If you don't you may very well end up with unexpected results. Consider this query:
Let's just say that the table "groupsUsers" is a cross reference (or bridge) table containing 2 columns, userid and groupid. With that piece of information let's have a quick show of cyber-hands - how many of you think this query will work (and by work I mean actually return values instead of errors)? Not too many of you right? That shows how sophisticated muse readers actually are. It definitely looks like a stinker because there is no "username" column in the "groupsUsers" table - right? So it should error out with a "column not found" type of error. Ah... not so fast.
Actually the query does work - though not as intended. You see, because the column name "username" is in the "users" table, when the DB server processes the query it says to itself... "username, username.... let's see, do I have a username column somewhere?" And because you have not specified the table name or an alias it says, "Ah ha! ...here's a username in the users table.". This has the effect of selecting the username from the users table for every row in the users table. To put it another way, the query above is actually doing this:
The fix is to specify the table using an alias or full table qualifier. That's good practice in all but the simplest applications. If you had done that in the example above...
Sometimes, especially when aggregating data, subqueries are more appropriate and even faster than joins. This is especially true when trying to produce summary data from several tables at once. If you'd like an example of this, I'm sure I can scrounge one up.
And, for those who don't know what all the fuss is about, here's an example of how both myself and Tony P might have rewritten the above example:
SELECT u.Username, u.fullname
FROM users AS u
INNER JOIN groupsUsers AS gu ON (u.userid = gu.userid)
WHERE (gu.groupId = 5)
If you are on a DBMS is that is so old-school as to not like the "INNER JOIN" keyword phrase, this might suit you better:
SELECT u.Username, u.fullname
FROM users AS u, groupsUsers AS gu
WHERE (u.userid = gu.userid)
AND (gu.groupId = 5)
Personally, I find that latter one much harder to read, as I like having my primary/foreign keys right up against they tables they apply to.
Ok... first off - don't pick on my rudimentary examples because they are well.... rudimentary.
Secondly, sub queries are a useful and accepted practice - and they represent (syntactically) an expression of function that makes them useful in certain cases. They express "give me all of the records in THIS table that match something or other in THAT table". They are pretty easy to pick up on. Certainly they can underperform and certainly joins are a better choice in many if not most cases. But I'd say it's just hubris to think that joins are "always" better and there is "never" a place for them - and to therefore conclude they have no place in best practice.
Thirdly, the point of the post was not to stir up a debate about sub queries. I was merely to point out a nuanced programming flaw that looks like a bug but is really the result of a lack of table qualifiers.
Now you boys place nice or I'm gonna delete these comments and take away your birthday :)
Presume that you have a sales system, such as a shopping cart. Your system allows you to place orders on hold so that you can perform credit checks, wait for pre-payment, double-check inventory, or whatever. You'd like a report at the beginning of each day that shows you how much money you could make if you could clear out all of the held orders, sort of as a speedometer for your known earnings potential, and how much work you'd have to do to make that money.
The schema:
Open orders are in Open_Order_Detail (one entry for each line/item)
Order holds are in Held_Orders (one entry for each hold code applied to an order)
The naive approach would be to do this:
SELECT
d.Order_Number,
COUNT(d.Line_Number) AS Items,
SUM(d.Extended_Price) AS Dollars,
COUNT(h.Hold_Code) AS Holds
FROM Open_Order_Detail AS d
LEFT OUTER JOIN Held_Orders AS h
ON (h.Order_Number = d.Order_Number)
GROUP BY d.Order_Number
ORDER BY d.Order_Number
It looks like it gives you a list of open orders, how many items are in each order, how much money the order is worth, and how many holds you need to clear to ship the order. Right?
Nope. If you consider that each order may have multiple items and multiple holds, you'll always end up with a cartesian product on your join. Look at the scenarios:
-- If you have 0 holds, you'll show the correct number of items and holds.
-- If you have 1 item and 2 holds, you'll actually show 2 items and 2 holds, as you'll have 2 lines (one for each hold) before the data is summarized. The item will be duplicated.
-- If you have 2 items and 1 hold, you'll show 2 items and 2 holds, as this time the hold code will be duplicated before summarization.
-- If you have 2 items and 2 holds, you'll get 4 items and 4 holds, as the both the items and holds will be duplicated before summarization.
Tricky, eh? You might try adding DISTINCT inside of your aggregation functions, but not all DBMSes support that, and what if you intentionally duplicate the same item or hold? I've seen plenty of systems that allow duplicate items in orders, especially when items can be customized. Duplicate holds are a bit more rare, but a case where multiple people need to release certain types of holds would not be out of the question. You could also write some tricky CASE statements to handle it, but ... why?
Instead, this query would provide you with what you are actually looking for:
SELECT d.Order_Number, d.Items, d.Dollars, COALESCE(h.Holds,0) AS Holds
FROM (
SELECT Order_Number, COUNT(*) AS Items, SUM(Extended_Price) AS Dollars
FROM Open_Order_Detail
GROUP BY Order_Number
) AS d LEFT OUTER JOIN (
SELECT Order_Number, COUNT(*) AS Holds
FROM Held_Orders
GROUP BY Order_Number
) AS h ON (h.Order_Number = d.Order_Number)
ORDER BY d.Order_Number
As you can see, I aggregate the data *before* I do the join so as to ensure that no row duplicating takes place. This means that I need a COALESCE to replace any NULLs with 0 values, but that's not horrendous.
You should also be able to see that this technique would come in even more use when you have many tables that need to be joined and summarized in this way. Maybe you also want to include a sales dollar total for the last 6 months for the customer who is doing the ordering so you can then prioritize the orders? That's another table that needs to be joined and summarized.
As for performance ... I tested these exact queries on the shopping cart system here at work. There are about 20 thousand orders and 500 hold codes in the system. The first (incorrect) query actually took several seconds longer than the second (correct) query, so not only would I be getting bad data, but I'd also be getting it more slowly. I ran each several times to ensure that all the necessary data was in memory, but the results did not change significantly. When I ran them against my Sales History table, with 2 million records, the speed difference was even more noticeably in favor of the second query.
So there you go.