ColdFusion Muse: Data Migration and Coldfusion

Data Migration and Coldfusion

Mark Kruger November 18, 2005 12:02 PM Coldfusion & Databases, Coldfusion Tips and Techniques Comments (6)

If you've been working with Coldfusion very long, chances are you've written a data import script. There are many tools that allow you to migrate data from one database platform or schema to another, and I'm well aware that "guru dogma" states that Coldfusion is not the best tool for things like long running tasks that can be performed by the database. I'm also a big advocate for letting the database do its job. So it may surprise you to learn that I believe Coldfusion is actually a pretty good choice in many cases - especially if you have to do anything tricky with the data. Take looping for example:

In T-SQL, if you want to loop through a record set and apply logic to the data you will need a cursor. Cursor code isn't hard to write, but it will definitely take you longer and require a bit more skill - especially debugging. Take this example from books on line....

DECLARE authors_cursor CURSOR FOR
SELECT au_lname, au_fname FROM authors
WHERE au_lname LIKE "B%"
ORDER BY au_lname, au_fname

OPEN authors_cursor

-- Perform the first fetch and store the values in variables.
-- Note: The variables are in the same order as the columns
-- in the SELECT statement.

FETCH NEXT FROM authors_cursor
INTO @au_lname, @au_fname

-- Check @@FETCH_STATUS to see if there are any more rows to fetch.
WHILE @@FETCH_STATUS = 0
BEGIN

-- Concatenate and display the current values in the variables.
PRINT "Author: " + @au_fname + " " + @au_lname

-- This is executed as long as the previous fetch succeeds.
FETCH NEXT FROM authors_cursor
INTO @au_lname, @au_fname
END

CLOSE authors_cursor
DEALLOCATE authors_cursor

This code works by declaring a query as a cursor, looping through each row and loading local variables @au_lname and @au_fname with values from the row with each iteration. It takes a good bit of cryptic code to do what Coldfusion can do like this:

<cfquery name="authors" datasource="pubs">
   SELECT au_lname, au_fname FROM authors
   WHERE au_lname LIKE "B%"
   ORDER BY au_lname, au_fname
</cfquery>

<cfoutput query="authors">
   Author: #au_fname# #au_lname# <br/>
</cfoutput>

Even a novice could see what is happening here.

Sometimes data import tasks are very challenging and require a good deal of data manipulation. Yes, you can write very complex SQL to do such things, but you can do it faster in Coldfusion. If you read my blog you know I'm an advocate for letting the Database do what it does best. But I'm also an advocate for being cost effective. If you are doing a data migration as part of a deployment, then there will be times when Coldfusion will allow you to quickly import records that may have taken a much greater effort in SQL. For example, you may be tasked with merging 2 databases and eliminating the duplicate data between them. You might write code like this.


<Cfquery name="qryAll" datasource="#myDsn#">
   SELECT username
   FROM    oldTable
</CFQUERY>

<cfloop query="qryAll">
   
   <Cfquery name="check" datasource="#myDsn#">
      SELECT   username
      FROM   newTable
      WHERE   username = '#username#'
   </CFQUERY>
   
   <Cfif NOT check.recordcount>

   <Cfquery name="put" datasource="#myDsn#">
      insert into newtable
         (username)
      VALUES
         ('#username#')
   </CFQUERY>
   </CFIF>

</cfloop>

In this simple example we are checking to see if a username from "oldtable" exists in "newtable" and if it does not, we are inserting it. Obviously, typical migration code is vastly more complicated. In this case we could eliminate the "check" query using SQL like this (which I prefer):

<Cfquery name="put" datasource="#myDsn#">
      IF NOT EXISTS
         (select username
          from newtable
          where username = '#username#')
      BEGIN
         insert into newtable
            (username)
         VALUES
            ('#username#')
      END
   </CFQUERY>

That would save us a step and a hit on the db.

When you do this sort of thing you have an issue you may not have thought about. When using Coldfusion Each connection to the db is an implicit commit. If you were doing it in T-SQL then each block would be an implicit commit. If, during SQL cursor code, an error is thrown, the transaction is rolled back - as if the process had never begun. This is not the case in your Coldfusion code however. That leaves you with a possibility that, in the case of an error, your data would be half imported to "newtable".

Using Cftransaction

This is where cftransaction comes to the rescue. You can treat the loop code as if it were a block of SQL - even though it may contain things other than cfquery. An exception will cause a roll-back of all the DB calls that have been made so far. Using cftransaction is simple.

<Cftransaction action="BEGIN">

<cfloop query="qryAll">
   <Cfquery name="put" datasource="#myDsn#">
      IF NOT EXISTS
         (select username
          from newtable
          where username = '#username#')
      BEGIN
         insert into newtable
            (username)
         VALUES
            ('#username#')
      END
   </CFQUERY>
</cfloop>
</Cftransaction>

Note that I'm not using a isolation level and I'm not explicitly telling the db to commit. The isolation level is probably not an issue if you are doing a pre-deployment data migration script. The "commit" is implied by the end transaction tag. You can get quite granular in your control of specific commits and roll backs if you like, but in this example we are saying simply, "if the process errors out before it completes then kill the whole thing". It's all or nothing.

Now you may be wondering about the accuracy of your duplicate record check. If I insert a username "BOB" inside of my cftransaction, then 10 rows later I try and insert "BOB" again, will the database "see" the first bob? After all, it's not committed yet - right? The answer is yes. The database will see bob number 1 and not insert bob number 2. Event though the transaction is not yet committed, the db can read from the new pages because it is "inside" the transaction. Your checks will work as you expect.

6 Comments

Posted By
Steve Gustafson | 11/18/05 10:43 AM

I think there is a significant caveat to the advice in this post. This type of solution will work fine with a trivial dataset, but would be brutal on an import of a few hundred thousand records ( I cringe to think of millions! ).

Additionally I felt compelled to point out that the cursor example you post is a silly use of a cursor. This is no more difficult in sql than cfml:
SELECT au_id,au_fname + ' ' + au_lname AS au_FULLNAME
FROM authors
WHERE au_lname LIKE 'B%'

(note: the order by au_lname is not likely to be needed in a data migration)

It's not "guru dogma" to use the tools that are best designed to do specific jobs. It's just good programming.

Gus
Posted By
mkruger | 11/18/05 11:17 AM

Gus - I agree with your sentiments about the cursor code. I could find no cursor code that I was at liberty to reveal without extensive modification - so I borrowed from "SQL books online". While books online is an excellent help resource - its examples (taken from Pubs or northwind) often fit the bill of a "square peg in a round hole" :)

Your advice about hundreds of thousands and even millions of rows is also well taken. In my expereience that advice would only apply to a small set of high power applications. In those cases the use of Coldfusion could indeed be laborious - not in writing the code (where CF might save time and money) but in actually running the script - where it would no doubt bog down and become a real dog.

Having said that, I've seen data sets of over 100k records migrated quite ably using CF. It is particulary useful where the schema is going to change dramatically from one database to the next - or when you are moving from a flat file or proprietary db to an RDBMS or when merging several identical dbs (identical as to schema). In many cases, a combination of stored procedures, cf queries and cf logic can save time and money.

You may have missed it, or perhaps you are unfamiliar with my writing style, but I always include cavaets and I'm never absolute about advice. For example, in this post I warned that I am an advocate for using the DB to do what it does best, and I even said that this advice might be a little surprising. I wrote "...you may be surprised to learn that I believe Coldfusion is actually a pretty good choice in many cases." (with "pretty good choice" in italics). I said "pretty good", not "excellent" and I said "many cases" not "most cases". Given that as my opening statement you can hardly say I was overly zealous in pushing Coldfusion as a choice :)
Posted By
Steve Bryant | 11/18/05 11:43 AM

Mark,

Very good post.

For your particular example, wouldn't the following code work:

<cfquery datasource="#myDsn#">
INSERT INTO newtable (username)
SELECT   username
FROM oldTable
WHERE   NOT EXISTS (
         SELECT   username
         FROM   newtable
         WHERE   username = oldTable.username
      )
</cfquery>

(of course, correlated subqueries fail badly for very large data sets)

That being said, I think that the approach you mention is a very good one, especially when dealing with data from different databases.

Thanks for the good information.
Posted By
Mkruger | 11/18/05 11:47 AM

Yeah - the example is too trivial to gain traction I guess. It's difficult to come up with examples that fit the topic but don't consume the entire post (ha).
Posted By
Will Spurgeon | 12/30/05 10:44 PM

I'm building a replication engine that typically will only handle a few hundred/few thousand rows, but occasionally may need to handle a couple of million :) Two factors make the pure use of stored procs and linked servers NOT my preferred option (though I'm well capable of this solution). A)The settings for each data pump are data driven, such that you could add a new pump simply by adding a row to a table with some dsn and tablename settings. B) I want proper error handling & logging features (MS SQL 2000 error handling is completely useless if you ever get a fatal error, which half of them are). I am determined to find a way to push millions of rows of MS SQL data via Coldfusion in an efficient manner (efficient being in no more than double the time it would take just using a "select into" across linked servers).

Assumer Serve 1 = Source and Server 2 = Destination. Currently I'm looking at creating a global temp table on Server 2 that can be inner and left joined to the target table to determine inserts/updates. The question is how to get the data out of Server 1 and into Server 2's temp table efficiently. I do not especially want to create the temp table in Server 2 by looping thru some Server 1 cfquery result, and I'm wondering if there is any technique for "dumping" a query result into the database quickly without looping, something akin to "insert X select ...". I'm even open to suggestions about not using cfquery at all, but perhaps using osql and bcp calls. Any tips or starting points? :) Gracias.
Posted By
Gus | 1/2/06 11:09 AM

Will,

From what you are describing I would definitely use DTS with MS SQL 2000.

In regard to the two factors you mention:
A) You can dynamically set the parameters for the data pump with DTS.

B) DTS can gracefully handle all kinds of error situations. The key is to trap would be fatal errors and handle them appropriately.

You can certainly also do this with BCP, but writing a text file of a couple million records will still be far more efficient with DTS or a stored procedure than via ColdFusion.

Gus

ColdFusion Muse

Data Migration and Coldfusion

Using Cftransaction

6 Comments

Posted By

Posted By

Posted By

Posted By

Posted By

Posted By

Recent Entries

Google Search

Sites to Watch

Other Guru Blogs

CF Webtools Blogs

Fan Pages & Support Groups

Subscribe

Recent Comments

Archives By Subject

Twitter

Facebook

Aggregators

RSS

inquiries