Lane's Tech Blog: SQL Server Bulk Insert Using a Format File to Read Data that also Includes a Comma in the String

Friday, August 29, 2008

SQL Server Bulk Insert Using a Format File to Read Data that also Includes a Comma in the String

There are a lot of resources out there on the net about using SQL Server Bulk Insert, and there are a lot of technical discussions about the inner details of Format Files. The problem is that there aren't many concrete examples of a very common need: importing a CSV (comma-separated) file where a quoted string also includes a comma. Here's an example of what some data might look like:

ID,Name,Phone 

100983,"Jones, Frank",555-1212

118928,"Smith, Joe",555-1313

115454,"Franklin, Alibaster",555-1414

Using a bulk insert statment without a format file would give us data that would look like this:

ID	Name	Phone
100983	"Jones	Frank"	555-1212
118928	"Smith	Joe"	555-1313
115454	"Franklin	Alibaster"	555-1414

When in reality, we want the data to look like this:

100983	Jones, Frank	555-1212
118928	Smith, Joe	555-1313
115454	Franklin, Alibaster	555-1414

The format file is your friend! Basically, you'll create a format file that looks like this:

0

3

SQLCHAR 0 6 ",\"" 1 empid SQL_Latin1_General_Cp437_BIN
SQLCHAR 0 50 "\","

name SQL_Latin1_General_Cp437_BIN
SQLCHAR 0 10 "\n"

phone SQL_Latin1_General_Cp437_BIN

Here's how the above breaks down: 8.0 is the version. That's constant. For SQL Server 2005, anyway. :) 3 tells the Bulk Insert statement how many fields there are in each row. Then we have the three fields listed: 1, 2, and 3 are the field numbers. 6, 50, and 10 are the field sizes. You'll need these to match the data types that are defined in your table. Now we get to our field delimiters. In a format file, delimiters are specified in double quotes. If a double quote is a part of a delimiter, then it needs to be escaped with a backslash:

",\"" (note the double double quotes) means, then, that there will be a comma followed by a double quote at the end of the first field. This is nice in that it does two things for us: first, it delimits the field, but it also will strip the first quote from the data as it's inserted.
"\"," means that there will be a quote followed by a comma at the end of the second field. In this way, we've skipped the comma that is in the data, and we've stripped the second quote from the data.
"\n" means that the last field will be delimited by an end-of-line character.

SQL_Latin1_General_Cp437_BIN has to do with the database collating setting, and many people simply leave that out of their format file without apparent ill effect. Now you can use a bulk insert statment like this:

bulk INSERT temptable FROM 'csvfilename' WITH (FIRSTROW = 2, FORMATFILE='formatfilename')

This will insert the CSV file into table TEMPTABLE, starting at the second row (we don't want to import the headers), using your new format file.

28 comments:

UnknownNovember 14, 2008 at 5:13 PM
How to do bulk insert for this formated file:

"name, www.site.com, 23"
ReplyDelete
Replies
LaneNovember 29, 2008 at 7:38 PM
Hi, Vinod.
It sounds like you're wanting to know what the format file would look like. In this case, it'd most likely look like the following. The first column simply is to get rid of the first quote. The second and third columns are delimited by a comma, and the last is delimited by a quote and an end-of-line character.
Let me know how that works for you!

8.0
3
1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
1 SQLCHAR 0 50 "," 1 name SQL_Latin1_General_Cp437_BIN
2 SQLCHAR 0 50 "\"," 2 url SQL_Latin1_General_Cp437_BIN
3 SQLCHAR 0 10 "\n" 3 id SQL_Latin1_General_Cp437_BIN
ReplyDelete
Replies
UnknownFebruary 4, 2009 at 10:28 PM
Lane,

This is exactly what I needed.

But when using the format file, the bulk insert will skip the first row under header, unless you change the column header Name to "Name" in the .csv file.

Any idea how to resolve this issue?

Thanks.
ReplyDelete
Replies
LaneFebruary 5, 2009 at 1:52 PM
Hi, Yuanfang,

Are you referring to Vinod's comment above, or to the original post? If the original post, you ought to be able to "fix" this by modifying the bulk insert statement (instead of the format file) like this:

bulk INSERT temptable FROM 'csvfilename' WITH (FIRSTROW = 1, FORMATFILE='formatfilename')

In the example above, the first row was 2. If we don't have a header, we'll change that to 1, and it'll start at row 1.

I sense that there may be more to your question, though; let me know if I'm not quite getting it!

-Lane
ReplyDelete
Replies
UnknownJune 16, 2009 at 2:55 PM
Great post and thx for the example of the format file above.

I've run into some other issues. If my users create a .csv file in Excel and one field has a comma in the value, only that field is enclosed by dbl quotes. No other fields in the column nor row are. Is there any way to handle that variability in the field separator?

Here is an example of the file that Excel created:
FirstName,LastName,Birthdate
Brad,Benson,3/3/1970
Sam,"Watson, Jr.",7/20/1971
ReplyDelete
Replies
LaneJune 19, 2009 at 8:58 PM
Oof. That's a tough one. I don't think you can create a format file that'll handle that situation. The problem is that bulk insert treats all the rows and fields the same, so if you have variable field delimiters, it just won't work.

So: it looks like the focus needs to be on the process used to create your csv file, instead. Is excel being used to create these files? If so, you could use another delimiter, like a tab character. Then you could set up your format file with '\t' as the field delimiter.

I know that some processes for creating files like this are complex, and it's very difficult to change on the client side. If that's your situation, I'm afraid you'll probably have to look to SSIS (integration services), which ought to do what you need.
But it's a bear to learn.

I hope that helps; let us know what you come up with! -Lane
ReplyDelete
Replies
LaneJune 19, 2009 at 9:06 PM
I just saw that you said quite clearly that they're creating these in Excel. That'll teach me to read more carefully.

You might also consider setting up a macro for your users to generate a standardized file. Here's one I ran across some time ago:
http://www.smokeylake.com/excel/text_write_program.htm
It might get you going with minimal fuss, and if it is possible for it to simplify the users' lives, even better.
ReplyDelete
Replies
UnknownJune 22, 2009 at 9:36 AM
Thx for the replies, Lane.

Unfortunately, it's a very open-ended process. I have little control over how the users supply the data. Currently I'm doing a line-by-line parse and inserting the data. It works for 95% of the cases when they don't upload an 8 MB file for parsing.

For those advanced users taxing the system, I added an option that let the user tell me their file is formatted consistently so I can use the fast bulk insert approach.

I guess if the .csv format itself was more rigidly defined and followed we wouldn't run into such issues. Msft likes to be flexible to make things easy on the users which can be hard on coders, but given the install base, we have no choice but to try to accommodate them.
ReplyDelete
Replies
LaneJuly 8, 2009 at 10:06 PM
Sigpop, I know what you mean: it'd be easier to be able to scrap a lot of processes and start all over again, sometimes, wouldn't it?
For what it's worth, I think that SSIS would be a good option for you; it's quite powerful, but with a steep learning curve. I'm going to get around to posting an article on this soon (fingers crossed!); hopefully it'll help with your situation.
ReplyDelete
Replies
LaneSeptember 8, 2009 at 9:11 PM
Sigpop, if you're still around, I've posted a guide on doing just this thing using SSIS. I'd love to know if it would fit your needs; we use it a lot for this kind of process. The post is here: http://lanestechblog.blogspot.com/2009/09/introduction-to-ssis.html
ReplyDelete
Replies
UnknownSeptember 30, 2009 at 2:16 PM
Hi Lane, is it possible to do a bulk insert with the quotation mark? If not how can I get rid of it automatic when doing bulk insert? I have a lot of csv file that need to import into a database and need a way to automatic remove the quotation mark.
ReplyDelete
Replies
LaneSeptember 30, 2009 at 3:11 PM
Hung, if the quotes are consistent (that is: if the quotes are present in each row), you can just use them in your formatfile as the column delimiter. to piggyback on Sigpop's example, let's say this is the data we're wanting to import:

FirstName,LastName,Birthdate
"Brad","Benson",3/3/1970
"Sam","Johnson",7/20/1971

The formatfile, then, would look like this:

8.0
3
1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
1 SQLCHAR 0 50 "\",\"" 1 FirstName SQL_Latin1_General_Cp437_BIN
2 SQLCHAR 0 50 "\"," 2 LastName SQL_Latin1_General_Cp437_BIN
3 SQLCHAR 0 10 "\n" 3 Birthdate SQL_Latin1_General_Cp437_BIN
ReplyDelete
Replies
LaneSeptember 30, 2009 at 3:16 PM
Basically what we're doing above is getting rid of the first quote as a dummy field, setting the delimiter between the first and second field as the string "," (including the quotes), which gets rid of the intervening quotes, setting the next delimiter as ", (again, including the quotes), which gets rid of the second string, and the final delimiter, of course, is an end of line character.

If you can't count on the quotes being there, though, I believe you've got to use SSIS, which isn't terribly difficult, once you get the hang of it. I've got detailed instructions on this post: http://lanestechblog.blogspot.com/2009/09/introduction-to-ssis.html
ReplyDelete
Replies
LaneSeptember 30, 2009 at 3:27 PM
Eeek. I just saw that, in copying an pasting, I copied an error from an earlier comment I made. Note that the Line numbers need to start with 1 and increment, not repeat #1, such that it ought to be:

8.0
3
1 SQLCHAR 0 255 "\"" 0 DummyField SQL_Latin1_General_Cp437_BIN
2 SQLCHAR 0 50 "\",\"" 1 FirstName SQL_Latin1_General_Cp437_BIN
3 SQLCHAR 0 50 "\"," 2 LastName SQL_Latin1_General_Cp437_BIN
4 SQLCHAR 0 10 "\n" 3 Birthdate SQL_Latin1_General_Cp437_BIN
ReplyDelete
Replies
AnonymousFebruary 18, 2010 at 7:41 PM
Hi Lane.

Thank you for the excellent example/guide that you wrote.

I use bulk insert for importing the Cisco Call Manager CDR files using this guide, but I encountered strange problem where it skip my 3rd row and start importing 4th row onwards. This happen to all my files.
I really wish if you can help me - if ok, can I send you sample to try it out?
From
Elton.
ReplyDelete
Replies
LaneFebruary 18, 2010 at 9:50 PM
Elton,
I'm happy to take a look, if you'd like. It sounds like the problem is that it's skipping not just the third row, but the first and second rows, as well; is that right?
Or are the first two rows something besides data, like header information?

In any case, if you'd like to post some of the data here in a comment, I'm sure it'd be helpful to everyone to see. I understand if that's not workable, though. I've set my blogger profile to display my email address. My profile is at www.blogger.com/profile/05548461086853951131.
ReplyDelete
Replies
AnonymousFebruary 19, 2010 at 3:06 AM
Hi Lane.

I really cannot believe that you replied and also responsed so fast.

Thank you very much. I will send the sample script and cdr data to you via email. I really hope if you can solve my problem.

From,
Elton.
ReplyDelete
Replies
AnonymousFebruary 23, 2010 at 3:20 AM
Hi Lane.

I just sent a mail with script and sample data to you gmail account.

regards,
Elton
ReplyDelete
Replies
Enrique PessoaMarch 24, 2010 at 3:16 PM
Dear fellows,

Be aware that, at least when using FORMATFILE, when counting lines for using FIRSTROW, BULK INSERT does only take in acount the lines that comply to the described format.

So, if you have the following 4 line file, you should use FIRSTROW=2 and not FIRSTROW=3.

people description file
firstname,lastname,age
me,person,20
you,person,20

If you have the following 6 line file, you should use FIRSTROW=3.

people description file
counting 1, 2, 3 - testing
too much people
firstname,lastname,age
me,person,20
you,person,20

good luck.
Enrique
ReplyDelete
Replies
AnonymousApril 30, 2014 at 6:37 PM
Hi Lane,

I'm actually using the same guide that Elton used and am running in to the same problem. It's skipping the first record after the first two "header" rows and starts importing at the fourth row. Did you ever find a solution to his issue? I believe my format file is correct.

Thanks.
ReplyDelete
Replies
LaneMay 1, 2014 at 9:45 AM
Hi, I suspect this is a case of your headers not having field delimiters between them is that right? The thing is, BCP and BULK INSERT don't actually skip rows, so much as they skip *records*. See this page for a summary:
http://msdn.microsoft.com/en-us/library/ms188365.aspx
in particular:
"The FIRSTROW attribute is not intended to skip column headers. Skipping headers is not supported by the BULK INSERT statement. When skipping rows, the SQL Server Database Engine looks only at the field terminators, and does not validate the data in the fields of skipped rows."

I can't get to the Experts Exchange post, so I'm not sure of the details there, but I'd bet this is the case. If you're working on a Windows stack only, I suspect your best bet is to use a Powershell script (in a SQL Agent job, is how I'd do it) to generate a new file without the headers.

Something along the lines of what is laid out here:
http://stackoverflow.com/questions/2074271/remove-top-line-of-text-file-with-powershell

Specifically,
(Get-Content $file | Select-Object -Skip 1) | Set-Content $file

Let us know if that fits the bill, or if I'm way off base, and we're looking at another problem, entirely.
ReplyDelete
Replies
AnonymousMay 3, 2014 at 2:15 PM
Thanks Lane. I ended up re-writing it in SSIS... probably better in the long run anyway. Thanks again for the help!
ReplyDelete
Replies
AnonymousJune 5, 2014 at 12:35 AM
Hi Lane,

Wonder if the bulk insert statement could create the table automatically as we are importing csv with frequent change on the number of columns.

Thanks.
ReplyDelete
Replies
sriramAugust 11, 2018 at 12:17 AM

Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
MS office course in chennai
ReplyDelete
Replies
UnknownNovember 8, 2018 at 4:06 AM
Students might also benefit from being able to take additional publications on line all through the school year or in the course of the summer, either for catchup or for enrichment. in regards to enrichment, smaller and poorer excessive schools have a tendency to have fewer superior placement (ap) services, optionally available guides professional assignment writing service melbourne, and foreign language publications compared to large colleges with better assets (barker 1985)
ReplyDelete
Replies
Jennifer AnistonFebruary 12, 2019 at 6:39 AM
Excellent information on your blog, thank you for taking the time to share with us. Amazing insight you have on this, it’s nice to find a website the details so much information about different artists.

hell fest denim jacket
ReplyDelete
Replies
UnknownApril 25, 2019 at 6:46 AM
Attractive information on your blog, thank you for taking the time and share with us. Amazing insight you have on this, it’s nice to find a website the details so much information about different artists.batman girl jacket

ReplyDelete
Replies
AnonumousMay 9, 2025 at 1:22 AM
Searching for a dependable laundry service near me? Dry Cleaning Junction offers exceptional wash and ironing services, ensuring your clothes receive the best care. Trust us for fast, professional laundry solutions that come to you.
ReplyDelete
Replies

Add comment

Thanks for leaving a comment!