ColdFusion Muse

CSV Files and Coldfusion - a New Approach

Mark Kruger February 5, 2007 12:45 PM Coldfusion Tips and Techniques Comments (32)

Many times we have been tasked with working out a way for an ordinary user to "upload" data into a system. This task is fraught with peril because users are not well versed in the idea of "normalized" data. We've tried the Excel file approach. Excel gives the user so many neat options they can't resist adding formatting, charts, graphs and fiddling with the headers. Even with protected files, clever users find ways to make the data incomplete or unusable. We've also tried access. This is workable but it takes more setup and it requires extra expense on the part of the user (namely, they must have excel). Enter the famed low-tech "csv" or "comma separated value" file.

This file is not rocket science by any means. It is a way to represent tabular data in a "plain text" fashion using commas and quotation marks. The commas separate the fields and the quotation marks identify the values (where needed). When are the quotation marks needed? Well, they are particularly needed when a comma might creep into a value. For example:

"Joe","Smith","112 Test Street"
"Sam","Jones, Jr.",114 test street"
Notice that Mr. Jones is a "Jones, Jr". Without the quotations marks as text qualifier that row of data would have 4 pieces of data instead of 3, breaking normalization.

So CSV is a common "lowest denominator" format we can use to get data into our system. In fact, we work with data feeds for financial and analyst data quite often and many of them provide their data in this format by default. So learning how to work with CSV is a good choice for upgrading your skills.

The problem

Coldfusion does not provide some easy way to read this data into an array or query. Working with it as a set of lists is ok, but how do you get around the "text qualifier" issue? Consider this post by Ben Nadel (who is no CF_slouch). His function reads a CSV string into an array. It's a good function, but long and involved. Another approach (one which I've used a few times) is to place the csv file at a URL and use CFHTTP to read it into a query object. There is nothing wrong with either of these approaches, but if you are on a windows box I've found an easier way.

Use the MS Text Driver

You can use the MS text driver to create a proxy for your file. Here are the steps.

  1. Setup a Data source - Set up a data source pointing to the MS Text driver. In CFMX this is done by creating an ODBC alias in the ODBC control panel and then using the "ODBC Socket" driver to connect to it. On a CF 5 box it will be in the drop down list.
  2. Use CF Query To Connect To Your File - The text driver serves as a container for any csv file to which you wish to connect. The code is easy.
    <Cfset FilePath = "C:\userfiles\upload.csv">

    <cfquery name="CSVData" datasource="GenericCSV">
        SELECT *
        FROM #filePath#
    </cfquery>
    This call will return a query object with your normalized data.
There are a couple of things to note. The first row of your CSV file will contain the column names. I'm sure there are ways around that and researching the text driver a little will unearth them. Also remember that the cleverest technique can be foiled by the least clever user. This technique works when the CSV file conforms to the proper standard.

Finally, you might note that this entry is similar to a popular technique I wrote about in this post on Connection DSN in CFMX. In the case of CSV files I suspect it will make life a lot easier.

  • Share:

Related Blog Entries

32 Comments

  • Steve Nelson's Gravatar
    Posted By
    Steve Nelson | 2/5/07 10:55 AM
    Man, i like that solution! Have you tried this on REALLY big csv files? I was just doing some data importing 2 weeks ago with a csv file that had 1/2 million records. I ended up doing a sql server bulk insert into a temporary table, then doing an insert/select statement into the normalized tables. It ended up being a very fast approach. Plus the bulk insert could handle the "qualified" list (i think that's what they call it).
  • JAlpino's Gravatar
    Posted By
    JAlpino | 2/5/07 12:21 PM
    Interesting technique, I wonder if this means you can perform aggregating functions on the initial import (ie SUM(), AVG(), ect...) or left to a QoQ after the fact
  • Mark Kruger's Gravatar
    Posted By
    Mark Kruger | 2/5/07 12:33 PM
    Steve, I haven't tried it on a really big file, but I suspect that it will be a better choice that the alternatives with CF (all of which involve reading the file into memory). I think the driver will do a better job of chunking the data or block factoring or whatever. At least that is what I suspect. If I have the chance I'll test it and repost.

    Jalpino - I haven't had the chance to dig up the docs for it... I'm sure there is at least a subset of stuff you can do, but given the propensity of CSV files to be full of lousy data I would hesitate :).
  • Steve Nelson's Gravatar
    Posted By
    Steve Nelson | 2/5/07 12:42 PM
    Mark i have a feeling it'll be very fast with large CSVs. I bet M$ uses the same engine in SQL Server's bulk insert to read it in. When i switched to bulk insert there was about 100 fold speed increase.
  • William Haun's Gravatar
    Posted By
    William Haun | 2/5/07 9:15 PM
    One word - "genius"
  • Dave Quested's Gravatar
    Posted By
    Dave Quested | 2/6/07 2:57 PM
    Great post and very handy. Just a note, I tried this in the root c:\ and it reported being unable to find the file. It was definitely there. Moved to a sub directory, and bingo it worked. Not sure what that is about, some permissions thing I guess. Thanks for the post.
  • Dave Quested's Gravatar
    Posted By
    Dave Quested | 2/6/07 3:00 PM
    ... not that you'd ever put there of course :)
  • mark kruger's Gravatar
    Posted By
    mark kruger | 2/6/07 3:20 PM
    No no... of course not. Only to prove that you should NOT put it there :)
  • grego's Gravatar
    Posted By
    grego | 2/20/07 8:40 AM
    Dude, you saved my life, or at least my afternoon and evening.
  • gg's Gravatar
    Posted By
    gg | 3/13/07 10:57 PM
    i'm using this query to page the data
    "select top 1000 * from [" + filetable + "] WHERE F1 NOT EXIST(SELECT TOP ((10-1)*1000) F1 FROM [" + filetable + "] ORDER BY F1 DESC)"

    and getting this error
    ERROR [42000] [Microsoft][ODBC Text Driver] Syntax error in query expression 'F1 NOT EXIST(SELECT TOP ((10-1)*1000) F1 FROM [myfile.csv] ORDER BY F1 DESC)'.
    any help
    thanks in advance
    gg
  • MIDOSH's Gravatar
    Posted By
    MIDOSH | 3/14/07 10:04 AM
    can i do the same procedure for reading excel files
  • midosh's Gravatar
    Posted By
    midosh | 3/14/07 10:06 AM
    i need to read excel files into a query, how do i do that.
  • tof's Gravatar
    Posted By
    tof | 3/18/07 10:23 PM
    I don't remember the reasons, but at work we started using the merant text-driver because of some limitation on the ms one.
  • Ryan Stille's Gravatar
    Posted By
    Ryan Stille | 4/12/07 2:00 PM
    A few things I've learned:

    - Both the ColdFusion ODBC Server and Agent services must be running.
    - The first line *must* contain column names. If not, and your first line ends up with data like: Ryan,201 B st,"omaha, NE",68154 - ColdFusion will choke on the comma inside the quotes.
    - When referencing the filename in your CFQUERY tag, do NOT use quotes around it (single or double). This will cause the "The Microsoft Jet database engine could not find the object" error I've seen mentioned on cf-talk.
  • mark kruger's Gravatar
    Posted By
    mark kruger | 4/12/07 2:14 PM
    Ryan - just to clarify - your comments refer to using this on CFMX. On productions servers - unless for some God awful reason we need access, we usually stop these 2 services. Now it appears we have a reason to enable them again (ha)>
  • Davide's Gravatar
    Posted By
    Davide | 7/25/07 9:18 AM
    Does this solution work with CF7?
    I created the Data Source in User DSN and SystemDSN and pointed to the directory and file where the csv is located but, I keep getting "Unable to retrieve error message from backend ODBC Driver." when trying to set the DSN in CF admin.
  • David Marc's Gravatar
    Posted By
    David Marc | 1/9/08 2:32 PM
    Some help for those lost:

    It's worth noting that the ODBC controll panel is actually in your windows platform. Go to cpanel / admin tools/ data sources (ODBC) / Click on the System DSN tab, and then click add. Use a microsoft text reader driver, and name the datasoruce.

    Then after that go into the coldfusion datasource choose odbc socket, and the one you made via windows should be in a dropdown in the ODBC socket setup screen.

    Works fine for me now.

    Peace
  • swamy's Gravatar
    Posted By
    swamy | 1/9/08 11:13 PM
    Hi , I am using merant text driver. while referring the columns which contains a space in it i.e IP Address , is giving error [MERANT][ODBC Text driver]Field not found: IP . please any one suggest how to refer the such columns of csv file in the CF query.
  • david marc's Gravatar
    Posted By
    david marc | 1/12/08 12:28 PM
    Also, if you find that when you try to start the coldfusion odbc service and it gives you an error message similar to "cannot find file, etc." then its beacuse you have coldfusion 8, and you didnt install documentation when you installed. It's a bug... re-install with full docs.
  • Rafi's Gravatar
    Posted By
    Rafi | 8/4/08 7:28 AM
    Hi , I wanted to read an excel file using coldfuison and update the oracle table that i have created with just few columns read from the excel file. Not all columns has to be updated. how tdo i map the specific columns in excel file to the table columns. Can anyone help me out with a best solution. I would be greatfull to all who help me.
    Thnaks.
  • Erik's Gravatar
    Posted By
    Erik | 8/26/08 3:22 PM
    If the resposibility for the content of the CSV file is the enduser, is there a way to check for or ignore commas in the data. I understand that quotes will work but is it still the responsibility of the end user to make sure the quotes are there?
  • ben's Gravatar
    Posted By
    ben | 1/28/09 7:03 PM
    You can change the delimiters (eg. read a tab delimited file) or specify column names and types by putting a schema.ini file in the directory where the text file is.
    http://msdn.microsoft.com/en-us/library/ms709353.a...

    does anyone know if you can specify this stuff using a connection string or something in the query?
  • BYJ's Gravatar
    Posted By
    BYJ | 8/5/09 4:28 PM
    Can this work in unix? anyone has example please?
  • mark kruger's Gravatar
    Posted By
    mark kruger | 8/5/09 5:07 PM
    To make it work in Linux or unix would require a text driver that works for linux/unix and probably a different syntax.
  • Brendan's Gravatar
    Posted By
    Brendan | 10/9/09 12:49 AM
    Any idea how to handle file paths that have blank spaces in them?
  • robert macri's Gravatar
    Posted By
    robert macri | 10/31/09 11:59 AM
    Great post. Just notice some strange things with what I assume to be the MS text driver. If the CSV has a field with an IP address, it converts it to a decimal number field , instead of leaving it a s a text field. An ip like 192.168.1.1 get read as192.1681.
    Any one have an idea how to get around this problem?
  • mark kruger's Gravatar
    Posted By
    mark kruger | 10/31/09 12:05 PM
    @Robert,

    If you are using a text qualifier in your CSV file that might do it. If you are exporting the CSV from Excel try adding a single quote (') in front of the IP address in the Excel file (that' ids it as a text field) before you export. I've not had this specific problem though - so no guarantees :)
  • robert macri's Gravatar
    Posted By
    robert macri | 12/13/09 10:50 AM
    Hi,

    No database coming from a Linux application. Can not change anything on CSV side. It there a way to set-up a data source definition file?
  • LuKeNuKuM's Gravatar
    Posted By
    LuKeNuKuM | 9/29/10 9:13 AM
    great post, thanks so much - saved much work with looping over lines and fields in a CSV. perfect.
  • Paul's Gravatar
    Posted By
    Paul | 11/9/11 8:21 PM
    Any idea if it's possible to run several different queries consecutively and, for each query, APPEND data to the SAME CSV file?
  • Phil Rodopoulos's Gravatar
    Posted By
    Phil Rodopoulos | 1/16/12 3:37 PM
    Great post Mark! Definitely faster than CF's methods.

    Question, when I attempt to append a where clause to the query it returns 0 results. Is that because it's querying a CSV file or because I'm doing something wrong?

    Ex.

    SELCT *
    FROM #filename#
    WHERE fieldname = 'something'
  • Greg Spradling's Gravatar
    Posted By
    Greg Spradling | 1/1/13 7:32 PM
    Brilliant!