Friday, August 29, 2008

SQL Server Bulk Insert Using a Format File to Read Data that also Includes a Comma in the String

There are a lot of resources out there on the net about using SQL Server Bulk Insert, and there are a lot of technical discussions about the inner details of Format Files. The problem is that there aren't many concrete examples of a very common need: importing a CSV (comma-separated) file where a quoted string also includes a comma. Here's an example of what some data might look like:
ID,Name,Phone 
100983,"Jones, Frank",555-1212
118928,"Smith, Joe",555-1313
115454,"Franklin, Alibaster",555-1414
Using a bulk insert statment without a format file would give us data that would look like this:
IDNamePhone
100983"JonesFrank"555-1212
118928"SmithJoe"555-1313
115454"FranklinAlibaster"555-1414
When in reality, we want the data to look like this:
100983Jones, Frank555-1212
118928Smith, Joe555-1313
115454Franklin, Alibaster555-1414
The format file is your friend! Basically, you'll create a format file that looks like this:
8.0
3
1 SQLCHAR 0 6 ",\"" 1 empid SQL_Latin1_General_Cp437_BIN 2 SQLCHAR 0 50 "\","
2 name SQL_Latin1_General_Cp437_BIN 3 SQLCHAR 0 10 "\n"
3 phone SQL_Latin1_General_Cp437_BIN
Here's how the above breaks down: 8.0 is the version. That's constant. For SQL Server 2005, anyway. :) 3 tells the Bulk Insert statement how many fields there are in each row. Then we have the three fields listed: 1, 2, and 3 are the field numbers. 6, 50, and 10 are the field sizes. You'll need these to match the data types that are defined in your table. Now we get to our field delimiters. In a format file, delimiters are specified in double quotes. If a double quote is a part of a delimiter, then it needs to be escaped with a backslash:
  • ",\"" (note the double double quotes) means, then, that there will be a comma followed by a double quote at the end of the first field. This is nice in that it does two things for us: first, it delimits the field, but it also will strip the first quote from the data as it's inserted.
  • "\"," means that there will be a quote followed by a comma at the end of the second field. In this way, we've skipped the comma that is in the data, and we've stripped the second quote from the data.
  • "\n" means that the last field will be delimited by an end-of-line character.
SQL_Latin1_General_Cp437_BIN has to do with the database collating setting, and many people simply leave that out of their format file without apparent ill effect. Now you can use a bulk insert statment like this:
bulk INSERT temptable FROM 'csvfilename' WITH (FIRSTROW = 2, FORMATFILE='formatfilename')
This will insert the CSV file into table TEMPTABLE, starting at the second row (we don't want to import the headers), using your new format file.

23 comments:

  1. How to do bulk insert for this formated file:

    "name, www.site.com, 23"

    ReplyDelete
  2. Hi, Vinod.
    It sounds like you're wanting to know what the format file would look like. In this case, it'd most likely look like the following. The first column simply is to get rid of the first quote. The second and third columns are delimited by a comma, and the last is delimited by a quote and an end-of-line character.
    Let me know how that works for you!


    8.0
    3
    1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
    1 SQLCHAR 0 50 "," 1 name SQL_Latin1_General_Cp437_BIN
    2 SQLCHAR 0 50 "\"," 2 url SQL_Latin1_General_Cp437_BIN
    3 SQLCHAR 0 10 "\n" 3 id SQL_Latin1_General_Cp437_BIN

    ReplyDelete
  3. Lane,

    This is exactly what I needed.

    But when using the format file, the bulk insert will skip the first row under header, unless you change the column header Name to "Name" in the .csv file.

    Any idea how to resolve this issue?


    Thanks.

    ReplyDelete
  4. Hi, Yuanfang,

    Are you referring to Vinod's comment above, or to the original post? If the original post, you ought to be able to "fix" this by modifying the bulk insert statement (instead of the format file) like this:

    bulk INSERT temptable FROM 'csvfilename' WITH (FIRSTROW = 1, FORMATFILE='formatfilename')

    In the example above, the first row was 2. If we don't have a header, we'll change that to 1, and it'll start at row 1.

    I sense that there may be more to your question, though; let me know if I'm not quite getting it!

    -Lane

    ReplyDelete
  5. Great post and thx for the example of the format file above.

    I've run into some other issues. If my users create a .csv file in Excel and one field has a comma in the value, only that field is enclosed by dbl quotes. No other fields in the column nor row are. Is there any way to handle that variability in the field separator?

    Here is an example of the file that Excel created:
    FirstName,LastName,Birthdate
    Brad,Benson,3/3/1970
    Sam,"Watson, Jr.",7/20/1971

    ReplyDelete
  6. Oof. That's a tough one. I don't think you can create a format file that'll handle that situation. The problem is that bulk insert treats all the rows and fields the same, so if you have variable field delimiters, it just won't work.

    So: it looks like the focus needs to be on the process used to create your csv file, instead. Is excel being used to create these files? If so, you could use another delimiter, like a tab character. Then you could set up your format file with '\t' as the field delimiter.

    I know that some processes for creating files like this are complex, and it's very difficult to change on the client side. If that's your situation, I'm afraid you'll probably have to look to SSIS (integration services), which ought to do what you need.
    But it's a bear to learn.

    I hope that helps; let us know what you come up with! -Lane

    ReplyDelete
  7. I just saw that you said quite clearly that they're creating these in Excel. That'll teach me to read more carefully.

    You might also consider setting up a macro for your users to generate a standardized file. Here's one I ran across some time ago:
    http://www.smokeylake.com/excel/text_write_program.htm
    It might get you going with minimal fuss, and if it is possible for it to simplify the users' lives, even better.

    ReplyDelete
  8. Thx for the replies, Lane.

    Unfortunately, it's a very open-ended process. I have little control over how the users supply the data. Currently I'm doing a line-by-line parse and inserting the data. It works for 95% of the cases when they don't upload an 8 MB file for parsing.

    For those advanced users taxing the system, I added an option that let the user tell me their file is formatted consistently so I can use the fast bulk insert approach.

    I guess if the .csv format itself was more rigidly defined and followed we wouldn't run into such issues. Msft likes to be flexible to make things easy on the users which can be hard on coders, but given the install base, we have no choice but to try to accommodate them.

    ReplyDelete
  9. Sigpop, I know what you mean: it'd be easier to be able to scrap a lot of processes and start all over again, sometimes, wouldn't it?
    For what it's worth, I think that SSIS would be a good option for you; it's quite powerful, but with a steep learning curve. I'm going to get around to posting an article on this soon (fingers crossed!); hopefully it'll help with your situation.

    ReplyDelete
  10. Sigpop, if you're still around, I've posted a guide on doing just this thing using SSIS. I'd love to know if it would fit your needs; we use it a lot for this kind of process. The post is here: http://lanestechblog.blogspot.com/2009/09/introduction-to-ssis.html

    ReplyDelete
  11. Hi Lane, is it possible to do a bulk insert with the quotation mark? If not how can I get rid of it automatic when doing bulk insert? I have a lot of csv file that need to import into a database and need a way to automatic remove the quotation mark.

    ReplyDelete
  12. Hung, if the quotes are consistent (that is: if the quotes are present in each row), you can just use them in your formatfile as the column delimiter. to piggyback on Sigpop's example, let's say this is the data we're wanting to import:

    FirstName,LastName,Birthdate
    "Brad","Benson",3/3/1970
    "Sam","Johnson",7/20/1971

    The formatfile, then, would look like this:

    8.0
    3
    1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
    1 SQLCHAR 0 50 "\",\"" 1 FirstName SQL_Latin1_General_Cp437_BIN
    2 SQLCHAR 0 50 "\"," 2 LastName SQL_Latin1_General_Cp437_BIN
    3 SQLCHAR 0 10 "\n" 3 Birthdate SQL_Latin1_General_Cp437_BIN

    ReplyDelete
  13. Basically what we're doing above is getting rid of the first quote as a dummy field, setting the delimiter between the first and second field as the string "," (including the quotes), which gets rid of the intervening quotes, setting the next delimiter as ", (again, including the quotes), which gets rid of the second string, and the final delimiter, of course, is an end of line character.

    If you can't count on the quotes being there, though, I believe you've got to use SSIS, which isn't terribly difficult, once you get the hang of it. I've got detailed instructions on this post: http://lanestechblog.blogspot.com/2009/09/introduction-to-ssis.html

    ReplyDelete
  14. Eeek. I just saw that, in copying an pasting, I copied an error from an earlier comment I made. Note that the Line numbers need to start with 1 and increment, not repeat #1, such that it ought to be:

    8.0
    3
    1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
    2 SQLCHAR 0 50 "\",\"" 1 FirstName SQL_Latin1_General_Cp437_BIN
    3 SQLCHAR 0 50 "\"," 2 LastName SQL_Latin1_General_Cp437_BIN
    4 SQLCHAR 0 10 "\n" 3 Birthdate SQL_Latin1_General_Cp437_BIN

    ReplyDelete
  15. Hi Lane.

    Thank you for the excellent example/guide that you wrote.

    I use bulk insert for importing the Cisco Call Manager CDR files using this guide, but I encountered strange problem where it skip my 3rd row and start importing 4th row onwards. This happen to all my files.
    I really wish if you can help me - if ok, can I send you sample to try it out?
    From
    Elton.

    ReplyDelete
  16. Elton,
    I'm happy to take a look, if you'd like. It sounds like the problem is that it's skipping not just the third row, but the first and second rows, as well; is that right?
    Or are the first two rows something besides data, like header information?

    In any case, if you'd like to post some of the data here in a comment, I'm sure it'd be helpful to everyone to see. I understand if that's not workable, though. I've set my blogger profile to display my email address. My profile is at www.blogger.com/profile/05548461086853951131.

    ReplyDelete
  17. Hi Lane.

    I really cannot believe that you replied and also responsed so fast.

    Thank you very much. I will send the sample script and cdr data to you via email. I really hope if you can solve my problem.

    From,
    Elton.

    ReplyDelete
  18. Hi Lane.

    I just sent a mail with script and sample data to you gmail account.

    regards,
    Elton

    ReplyDelete
  19. Dear fellows,

    Be aware that, at least when using FORMATFILE, when counting lines for using FIRSTROW, BULK INSERT does only take in acount the lines that comply to the described format.

    So, if you have the following 4 line file, you should use FIRSTROW=2 and not FIRSTROW=3.

    people description file
    firstname,lastname,age
    me,person,20
    you,person,20

    If you have the following 6 line file, you should use FIRSTROW=3.

    people description file
    counting 1, 2, 3 - testing
    too much people
    firstname,lastname,age
    me,person,20
    you,person,20

    good luck.
    Enrique

    ReplyDelete
  20. Hi Lane,

    I'm actually using the same guide that Elton used and am running in to the same problem. It's skipping the first record after the first two "header" rows and starts importing at the fourth row. Did you ever find a solution to his issue? I believe my format file is correct.

    Thanks.

    ReplyDelete
  21. Hi, I suspect this is a case of your headers not having field delimiters between them is that right? The thing is, BCP and BULK INSERT don't actually skip rows, so much as they skip *records*. See this page for a summary:
    http://msdn.microsoft.com/en-us/library/ms188365.aspx
    in particular:
    "The FIRSTROW attribute is not intended to skip column headers. Skipping headers is not supported by the BULK INSERT statement. When skipping rows, the SQL Server Database Engine looks only at the field terminators, and does not validate the data in the fields of skipped rows."

    I can't get to the Experts Exchange post, so I'm not sure of the details there, but I'd bet this is the case. If you're working on a Windows stack only, I suspect your best bet is to use a Powershell script (in a SQL Agent job, is how I'd do it) to generate a new file without the headers.

    Something along the lines of what is laid out here:
    http://stackoverflow.com/questions/2074271/remove-top-line-of-text-file-with-powershell

    Specifically,
    (Get-Content $file | Select-Object -Skip 1) | Set-Content $file

    Let us know if that fits the bill, or if I'm way off base, and we're looking at another problem, entirely.

    ReplyDelete
  22. Thanks Lane. I ended up re-writing it in SSIS... probably better in the long run anyway. Thanks again for the help!

    ReplyDelete
  23. Hi Lane,

    Wonder if the bulk insert statement could create the table automatically as we are importing csv with frequent change on the number of columns.

    Thanks.

    ReplyDelete

Thanks for leaving a comment!