Friday, August 29, 2008

SQL Server Bulk Insert Using a Format File to Read Data that also Includes a Comma in the String

There are a lot of resources out there on the net about using SQL Server Bulk Insert, and there are a lot of technical discussions about the inner details of Format Files. The problem is that there aren't many concrete examples of a very common need: importing a CSV (comma-separated) file where a quoted string also includes a comma. Here's an example of what some data might look like:
100983,"Jones, Frank",555-1212
118928,"Smith, Joe",555-1313
115454,"Franklin, Alibaster",555-1414
Using a bulk insert statment without a format file would give us data that would look like this:
When in reality, we want the data to look like this:
100983Jones, Frank555-1212
118928Smith, Joe555-1313
115454Franklin, Alibaster555-1414
The format file is your friend! Basically, you'll create a format file that looks like this:
1 SQLCHAR 0 6 ",\"" 1 empid SQL_Latin1_General_Cp437_BIN 2 SQLCHAR 0 50 "\","
2 name SQL_Latin1_General_Cp437_BIN 3 SQLCHAR 0 10 "\n"
3 phone SQL_Latin1_General_Cp437_BIN
Here's how the above breaks down: 8.0 is the version. That's constant. For SQL Server 2005, anyway. :) 3 tells the Bulk Insert statement how many fields there are in each row. Then we have the three fields listed: 1, 2, and 3 are the field numbers. 6, 50, and 10 are the field sizes. You'll need these to match the data types that are defined in your table. Now we get to our field delimiters. In a format file, delimiters are specified in double quotes. If a double quote is a part of a delimiter, then it needs to be escaped with a backslash:
  • ",\"" (note the double double quotes) means, then, that there will be a comma followed by a double quote at the end of the first field. This is nice in that it does two things for us: first, it delimits the field, but it also will strip the first quote from the data as it's inserted.
  • "\"," means that there will be a quote followed by a comma at the end of the second field. In this way, we've skipped the comma that is in the data, and we've stripped the second quote from the data.
  • "\n" means that the last field will be delimited by an end-of-line character.
SQL_Latin1_General_Cp437_BIN has to do with the database collating setting, and many people simply leave that out of their format file without apparent ill effect. Now you can use a bulk insert statment like this:
bulk INSERT temptable FROM 'csvfilename' WITH (FIRSTROW = 2, FORMATFILE='formatfilename')
This will insert the CSV file into table TEMPTABLE, starting at the second row (we don't want to import the headers), using your new format file.


  1. How to do bulk insert for this formated file:

    "name,, 23"

  2. Hi, Vinod.
    It sounds like you're wanting to know what the format file would look like. In this case, it'd most likely look like the following. The first column simply is to get rid of the first quote. The second and third columns are delimited by a comma, and the last is delimited by a quote and an end-of-line character.
    Let me know how that works for you!

    1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
    1 SQLCHAR 0 50 "," 1 name SQL_Latin1_General_Cp437_BIN
    2 SQLCHAR 0 50 "\"," 2 url SQL_Latin1_General_Cp437_BIN
    3 SQLCHAR 0 10 "\n" 3 id SQL_Latin1_General_Cp437_BIN

  3. Lane,

    This is exactly what I needed.

    But when using the format file, the bulk insert will skip the first row under header, unless you change the column header Name to "Name" in the .csv file.

    Any idea how to resolve this issue?


  4. Hi, Yuanfang,

    Are you referring to Vinod's comment above, or to the original post? If the original post, you ought to be able to "fix" this by modifying the bulk insert statement (instead of the format file) like this:

    bulk INSERT temptable FROM 'csvfilename' WITH (FIRSTROW = 1, FORMATFILE='formatfilename')

    In the example above, the first row was 2. If we don't have a header, we'll change that to 1, and it'll start at row 1.

    I sense that there may be more to your question, though; let me know if I'm not quite getting it!


  5. Great post and thx for the example of the format file above.

    I've run into some other issues. If my users create a .csv file in Excel and one field has a comma in the value, only that field is enclosed by dbl quotes. No other fields in the column nor row are. Is there any way to handle that variability in the field separator?

    Here is an example of the file that Excel created:
    Sam,"Watson, Jr.",7/20/1971

  6. Oof. That's a tough one. I don't think you can create a format file that'll handle that situation. The problem is that bulk insert treats all the rows and fields the same, so if you have variable field delimiters, it just won't work.

    So: it looks like the focus needs to be on the process used to create your csv file, instead. Is excel being used to create these files? If so, you could use another delimiter, like a tab character. Then you could set up your format file with '\t' as the field delimiter.

    I know that some processes for creating files like this are complex, and it's very difficult to change on the client side. If that's your situation, I'm afraid you'll probably have to look to SSIS (integration services), which ought to do what you need.
    But it's a bear to learn.

    I hope that helps; let us know what you come up with! -Lane

  7. I just saw that you said quite clearly that they're creating these in Excel. That'll teach me to read more carefully.

    You might also consider setting up a macro for your users to generate a standardized file. Here's one I ran across some time ago:
    It might get you going with minimal fuss, and if it is possible for it to simplify the users' lives, even better.

  8. Thx for the replies, Lane.

    Unfortunately, it's a very open-ended process. I have little control over how the users supply the data. Currently I'm doing a line-by-line parse and inserting the data. It works for 95% of the cases when they don't upload an 8 MB file for parsing.

    For those advanced users taxing the system, I added an option that let the user tell me their file is formatted consistently so I can use the fast bulk insert approach.

    I guess if the .csv format itself was more rigidly defined and followed we wouldn't run into such issues. Msft likes to be flexible to make things easy on the users which can be hard on coders, but given the install base, we have no choice but to try to accommodate them.

  9. Sigpop, I know what you mean: it'd be easier to be able to scrap a lot of processes and start all over again, sometimes, wouldn't it?
    For what it's worth, I think that SSIS would be a good option for you; it's quite powerful, but with a steep learning curve. I'm going to get around to posting an article on this soon (fingers crossed!); hopefully it'll help with your situation.

  10. Sigpop, if you're still around, I've posted a guide on doing just this thing using SSIS. I'd love to know if it would fit your needs; we use it a lot for this kind of process. The post is here:

  11. Hi Lane, is it possible to do a bulk insert with the quotation mark? If not how can I get rid of it automatic when doing bulk insert? I have a lot of csv file that need to import into a database and need a way to automatic remove the quotation mark.

  12. Hung, if the quotes are consistent (that is: if the quotes are present in each row), you can just use them in your formatfile as the column delimiter. to piggyback on Sigpop's example, let's say this is the data we're wanting to import:


    The formatfile, then, would look like this:

    1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
    1 SQLCHAR 0 50 "\",\"" 1 FirstName SQL_Latin1_General_Cp437_BIN
    2 SQLCHAR 0 50 "\"," 2 LastName SQL_Latin1_General_Cp437_BIN
    3 SQLCHAR 0 10 "\n" 3 Birthdate SQL_Latin1_General_Cp437_BIN

  13. Basically what we're doing above is getting rid of the first quote as a dummy field, setting the delimiter between the first and second field as the string "," (including the quotes), which gets rid of the intervening quotes, setting the next delimiter as ", (again, including the quotes), which gets rid of the second string, and the final delimiter, of course, is an end of line character.

    If you can't count on the quotes being there, though, I believe you've got to use SSIS, which isn't terribly difficult, once you get the hang of it. I've got detailed instructions on this post:

  14. Eeek. I just saw that, in copying an pasting, I copied an error from an earlier comment I made. Note that the Line numbers need to start with 1 and increment, not repeat #1, such that it ought to be:

    1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
    2 SQLCHAR 0 50 "\",\"" 1 FirstName SQL_Latin1_General_Cp437_BIN
    3 SQLCHAR 0 50 "\"," 2 LastName SQL_Latin1_General_Cp437_BIN
    4 SQLCHAR 0 10 "\n" 3 Birthdate SQL_Latin1_General_Cp437_BIN

  15. Hi Lane.

    Thank you for the excellent example/guide that you wrote.

    I use bulk insert for importing the Cisco Call Manager CDR files using this guide, but I encountered strange problem where it skip my 3rd row and start importing 4th row onwards. This happen to all my files.
    I really wish if you can help me - if ok, can I send you sample to try it out?

  16. Elton,
    I'm happy to take a look, if you'd like. It sounds like the problem is that it's skipping not just the third row, but the first and second rows, as well; is that right?
    Or are the first two rows something besides data, like header information?

    In any case, if you'd like to post some of the data here in a comment, I'm sure it'd be helpful to everyone to see. I understand if that's not workable, though. I've set my blogger profile to display my email address. My profile is at

  17. Hi Lane.

    I really cannot believe that you replied and also responsed so fast.

    Thank you very much. I will send the sample script and cdr data to you via email. I really hope if you can solve my problem.


  18. Hi Lane.

    I just sent a mail with script and sample data to you gmail account.


  19. Dear fellows,

    Be aware that, at least when using FORMATFILE, when counting lines for using FIRSTROW, BULK INSERT does only take in acount the lines that comply to the described format.

    So, if you have the following 4 line file, you should use FIRSTROW=2 and not FIRSTROW=3.

    people description file

    If you have the following 6 line file, you should use FIRSTROW=3.

    people description file
    counting 1, 2, 3 - testing
    too much people

    good luck.


Thanks for leaving a comment!