Showing posts with label flat. Show all posts
Showing posts with label flat. Show all posts

Monday, February 27, 2012

Handling flat files that do not exist

Hello,

I have a package that contains 22 data flow tasks, one for each flat file that I need to process and import. I decided against making each import a seperate package because I am loading the package in an external application and calling it from there.

Now, everything works beautifully when all my text files are exported from a datasource beyond my control. I have an application that processes a series of files encoded using EBCDIC and I am not always gauranteed that all the flat files will be exported. (There may have not been any data for the day.)

I am looking for suggestions on how to handle files that do not exist. I have tried making a package level error handler (Script task) that checks the error code ("System::ErrorCode") and if it tells me that the file cannot be found, I return Dts.TaskResult = Dts.Results.Sucsess, but that is not working for me, the package still fails. I have also thought about progmatically disabling the tasks that do not have a corresponding flat file, but it seems like over kill.

So I guess my question is this; if the file does not exist, how can I either a) skip the task in the package, or b) quietly handle the error and move on without failing the package?

Thanks!

Lee.

I recall that I have a couple data flows in my package, each within a sequence container and connected them with completion(blue) arrows and even if the first one failed the second one continued on...

|||

That is one solution that I could use. It doesn't seem like it is the "elegant" way to handle it, but it will work.

Thanks for the suggestion.

|||

well

Is dragging red arrow to some alert/send mail task will not serve purpose?

|||

Scoutn wrote:

Hello,

I have a package that contains 22 data flow tasks, one for each flat file that I need to process and import. I decided against making each import a seperate package because I am loading the package in an external application and calling it from there.

Now, everything works beautifully when all my text files are exported from a datasource beyond my control. I have an application that processes a series of files encoded using EBCDIC and I am not always gauranteed that all the flat files will be exported. (There may have not been any data for the day.)

I am looking for suggestions on how to handle files that do not exist. I have tried making a package level error handler (Script task) that checks the error code ("System::ErrorCode") and if it tells me that the file cannot be found, I return Dts.TaskResult = Dts.Results.Sucsess, but that is not working for me, the package still fails. I have also thought about progmatically disabling the tasks that do not have a corresponding flat file, but it seems like over kill.

So I guess my question is this; if the file does not exist, how can I either a) skip the task in the package, or b) quietly handle the error and move on without failing the package?

Thanks!

Lee.

You can use a script task to check whether or not a file exists. This code should help:

File.Exists Method
(http://msdn2.microsoft.com/en-us/library/system.io.file.exists.aspx)

-Jamie

|||

Jamie,

First allow me to say that your blog has been an invaluable resource. I have learned a lot from your articles, thank you!

Utsav, I am not looking to be notified of the error; I just want the task to merrily carry on like nothing took place. I do understand what you meant though, thank you.

Sometimes the flat file will be there and sometimes it won't be. If it isn't, I would still like the task move on to the next with a "Success" result, not just completion. That is why I was thinking of disabling the tasks that do not have corresponding flat files progmatically when I load the package from my application.

I guess I should be asking; when I set Dts.TaskResult = Dts.Results.Success, why does it not actually return success?

(I know there are a couple of ways I can get around this issue, I'm just looking for the easiest without setting the task to continue on "Completion" ;)

Thanks again,
Lee.

|||

Check if this thread has something that can help you.

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=859625&SiteID=1

Instead of disabling the task; try to use an expression in the precedence constraint in the control flow.

Rafael Salas

|||Thank you very much. I did search the forum, I guess I was just using the wrong search term.

Perfect!|||

The thread (and the subsequent blog articles) solved my problem. I needed to put each task in a sequence container with their own script task checking for the files; if I used one script task to look for all the files the next task would not run because at least one file would return false.

Thanks again for everyones help!

Friday, February 24, 2012

Handling Empty Strings In DTS

Hello,
I have a transformation in which the column of data at the flat file source is nine characters long, and typically contains a string of six or seven zeros with a non-zero number in the last two or three characters. If none of the records in that column were an empty string, I think I could get away with this:

DTSDestination("TTLCrd") = CInt(DTSSource("Col004"))

The destination is a SQL Server 2000 table, and the column is of type Integer. What do I do when Col004 is an empty string? I've tried a couple of different IF statements, but they have not worked. Empty strings need to become zero values.

Thank you for your help.

cdun2Avoid using DTS to insert directly into production tables.
Avoid putting logic or code that manipulates data in your DTS package.
Use DTS for moving data from point A to point B. All the other features of DTS (or the new SSIS) are crap, and lead inevitably to bad application design.

Best practice is to use DTS to pipe your data to a staging table and then kick off a stored procedure to process the data in the staging table, verifying and cleansing the data before pushing it into the production tables. The stored procedure will hold all of the data logic, including the COALESCE() function, which will easily convert your NULL values to zeros.|||Thank you for your response!
cdun2|||Best practice is to use DTS to pipe your data to a staging table and then kick off a stored procedure to process the data in the staging table, verifying and cleansing the data before pushing it into the production tables. The stored procedure will hold all of the data logic, including the COALESCE() function, which will easily convert your NULL values to zeros.
So what you say is that you should have staging tables (potentially with all nvarchar field if you for instance import from text files), generate a whole lot of disk activity, before you kick off stored procs to do all the work for you? I cannot see why one would want to do it that way. I'm pretty satisfied with the way SSIS work, and would very much like to know why you discourage use of the SSIS features.|||I'm totally with blindman on this. My question in turn would be why would you want to incorporate data\ business logic into discrete, proprietry DTS\ SSIS packages? Virtually everything else I do in T-SQL unless there is no alternative. Even if I used DTS or SSIS at all it would be to get stuff from outside the database to inside it with as little fuss as possible - nothing more. I see nothing gained throwing these ETL tools at the problem when the standard SS language is perfectly capable of everything I have come across so far.

You probably know though that I go further even than blindman and do not use DTS or SSIS at all.

As far as the disk usage is concerned, from my perspective I work with large batch systems with well specced servers. We are not on a big time or resource pressure when we load so for me it is not a consideration.|||KISS.
By not creating code in my DTS package, I keep all of my logic in one place(the database) and in one language(SQL). That makes debugging much easier.
It also compartmentalizes my process, meaning I could use DTS, or BCP, or SSIS, or ASP, or a friggin' Access macro to load data, and my sproc will process it. I can even have multiple data flows going into the staging table.
My staging tables contain columns that record the source of each record, the time it entered the database, and a column for recording any processing errors. As my sproc cleanses, verifies, and loads the staging data, any discrepancies are noted within the ErrorStatus column. At the end of the process finding any records that failed and the reasons for their failure is a snap, and all I need to do is fix the existing staging records and reset the ErrorStatus column, and then I can rerun the sproc.
ETL has become bread and butter to me now. I can almost write these sprocs and packages with my eyes closed.
ROAC, how easily did your DTS packages upgrade to SSIS?|||As far as the disk usage is concerned, from my perspective I work with large batch systems with well specced servers. We are not on a big time or resource pressure when we load so for me it is not a consideration.
Well, I see. In that case I would do the same. However, there are many MANY companies around, especially in smaller countries, that cannot afford this kind of systems. The gap between for instance an EVA4000 and EVA6000 is huge for many companies, and they have to take disk performance into consideration.|||KISS.
ROAC, how easily did your DTS packages upgrade to SSIS?
Agreed. Keep things simple. If you are having a lot of data sources, I find it more easy working with SSIS than having tons of procs.

When it comes to upgrading, my DTS packages upgraded pretty well, but I know quite a few that did not, which of course is an issue. However, with almost the same arguments you come up with, you could tell to do the work in ASP.NET or Java as well, and have issues when .NET Framework or Java language is upgraded. Or, for that matter, when SQL Syntax changes.

As I said in my last post, please keep in mind that there are smaller countries and companies in the world. What's best for an enterprise is not neccessarily best for a small or medium sized business in Scandinavia or the Baltics. Thus, I think your advices perhaps should be something like "If you can afford ..., you should ...". Got my idea?

I have no problem seeing your points, I just cannot see it as the only solution.|||Well, I see. In that case I would do the same. However, there are many MANY companies around, especially in smaller countries, that cannot afford this kind of systems. The gap between for instance an EVA4000 and EVA6000 is huge for many companies, and they have to take disk performance into consideration.I might add that the organisation I worked for prior to this was certainly not large, nor specialist like my present company, but I stuck to the same philosophy then. Actually - I think I just did :)

The most expensive resource of all is the bum in the seat supporting the hands keying in the code. This bum is likely to know T-SQL if he\ she is working with SQL Server. Why is it cost effective to introduce a GUI based ETL tool into the mix when bog standard T-SQL is perfectly capable? The worst short term ROI an employer gets from me is when I am wrestling with a new language\ gui\ tool etc.. I think the time aspect of my point might be a pressure for using SSIS\ DTS but not the resources - I'm afraid I have no idea what the difference is between EVA4000 and EVA6000 :)|||ROAC, why do you think my method of storing the data in staging tables and then running a sproc against them is going to entail more disk activity/server resources than an SSIS package?
Presumably the amount of data being imported is some fraction of the data that already exists in production, so if the server is beefy enough to handle day-to-day processing it should not choke on running sprocs against staging data. Especially since these are normally run as batch processes during maintenance hours.|||As the data is written do disk twice instead of once, yes it WILL consume more disk resources. If you are lucky enough to do all the job at night, well then you are definitely more lucky than I am. As I said previously, if you can have the extra disk load, your approach is the best. I'm not questioning that. I just want to make it clear, for you and other reders, that your scenario is not the only one around. Other people may have other needs, which will lead to other solutions. Doubling the disk activity required to import data is not always an option. You are lucky enough to have that option, but is it so hard to believe that other people not neccessarily have the same situation as you?

I think you perhaps should open your eyes a bit and look around, because there are solutions out there, behaving quite differently from those you are working with.|||Disk activity, hmmm..
Well, as a consultant who has created production and data warehouse ETL solutions for dozens of companies in many industries I can't say I've run into a situation where that was the deciding factor in the application architecture. But I'll grant such a situation may be possible.

Obviously I don't think my method is the only way to go. There are certainly many implementations using DTS, SSIS, or 3rd party tools (I recently had to suffer through a project where the client used a tool called DataStage).

The problem is, too many inexperienced people jump into creating DTS/SSIS solutions simply because THEY assume that THAT is the only method. When in reality (and I'm not backing down on this), these GUI tools are SELDOM the best method and lead to fragile designs that are difficult to debug or modify.|||BTW - Roac - totally agreed that everything deserves evaluation in the context of the entire circumstance. There are no absolutes - that is why we are having this fun discussion where we are sharing our opinions :)|||Its all good. <\TouchGloves>|||Lol - Roac - I've just remembered my first post addressing you was when I thought you had made a somewhat absolute statement too. What goes around comes around eh? On that one you were in agreement with blindman.|||I sit on the fence in these issues...

There are many things that I've written in DTS/SSIS where there is simple conversion of incoming data, and as long as there is a well defined way to deal with "bad" data this works very well. The thing that most DTS packages seemed to (incorrectly) assume was that all of the incoming data would be processed correctly on the first try.

When there are enough resources and time, the staging table approach works very well. You need to keep in mind that time is often my biggest problem, and that large amounts of data usually take large amounts of time to process. This can make it functionally impossible to stage some kinds of data because the processing window isn't large enough to support physically handling the data multiple times.

I don't have any inherant problem with either approach. Both work, and with the proper discipline for the ETL approach and sufficient resources (disk, time, etc) for the staging approach they can both produce the same results. Unfortunately, both discipline and resources are often scarce, so you need to find the best solution for the problem at hand.

-PatP

handling double quotes

Hi
i am importing data from table to flat file(csv). i have two problems
1. if a column has commas(,) it should not create a new column i.e the column in csv file can have commas
2.if a column has double quotes then csv file column should have double quotes. please help me.I am using derived column I dont know how to search double quotes in string.

if my table has 2 columns

col1 col2

a abc,"scfddf"ghisk

b bc,de

c de

my csv file should look like this

a abc,"scfddf" ghisk

b bc,de

c de

thanks

I don't understand... You define two columns in the file connection manager and just hook up the data flow columns to it.

In the examples you've given above, I don't see a need for the derived column transformation.

What are you seeing in your results?|||I agree, if you hook up an OLEDB Source to Flat file in your data flow task, this should already get handled automatically.

handling double quotes

Hi
i am importing data from table to flat file(csv). i have two problems
1. if a column has commas(,) it should not create a new column i.e the column in csv file can have commas
2.if a column has double quotes then csv file column should have double quotes. please help me.I am using derived column I dont know how to search double quotes in string.

if my table has 2 columns

col1 col2

a abc,"scfddf"ghisk

b bc,de

c de

my csv file should look like this

a abc,"scfddf" ghisk

b bc,de

c de

thanks

I don't understand... You define two columns in the file connection manager and just hook up the data flow columns to it.

In the examples you've given above, I don't see a need for the derived column transformation.

What are you seeing in your results?|||I agree, if you hook up an OLEDB Source to Flat file in your data flow task, this should already get handled automatically.

Handling bad data

I have an SSIS package that takes in a flat file and pushes the data into a table. However, once in a while the file will have some bad data in it (for example, this particular time I have too many delimiters on one line). I want the package to redirect the row to an error table and keep going with the processing for the rest of the data. To do this, I have hooked up an error output from the flat file source to an OLE DB destination and assigned all the rows to Redirect Row on Error in the Error Output section of the flat file source. Unfortunately, it does not work! The flat file receives an error and stops. Is this because something different has to happen when it is a problem with the whole row? Any help would be appreciated, thanks.Make sure you you visit the 'error output' page of the flat file source and change the value of the ERROR column to redirect row...|||Unfortunately, I've done this and it still doesn't work. I double checked and all of the columns in the Error Output section are set to Redirect Row.|||Perhaps you could have a first data flow that treats each line in the file as a single column; the with a script component you parse each row to count the number of delimiters and exclude those that exceeds the expected number. Just a thought...|||

That sounds like it should work! I haven't had a chance to work on it since I've been sidetracked into another project. Thanks for the help, I'm sorry that this took SO long for me to respond.

|||

Here's an example. hopefully it helps:

http://agilebi.com/cs/blogs/jwelch/archive/2007/05/08/handling-flat-files-with-varying-numbers-of-columns.aspx

Handling bad data

I have an SSIS package that takes in a flat file and pushes the data into a table. However, once in a while the file will have some bad data in it (for example, this particular time I have too many delimiters on one line). I want the package to redirect the row to an error table and keep going with the processing for the rest of the data. To do this, I have hooked up an error output from the flat file source to an OLE DB destination and assigned all the rows to Redirect Row on Error in the Error Output section of the flat file source. Unfortunately, it does not work! The flat file receives an error and stops. Is this because something different has to happen when it is a problem with the whole row? Any help would be appreciated, thanks.Make sure you you visit the 'error output' page of the flat file source and change the value of the ERROR column to redirect row...|||Unfortunately, I've done this and it still doesn't work. I double checked and all of the columns in the Error Output section are set to Redirect Row.|||Perhaps you could have a first data flow that treats each line in the file as a single column; the with a script component you parse each row to count the number of delimiters and exclude those that exceeds the expected number. Just a thought...|||

That sounds like it should work! I haven't had a chance to work on it since I've been sidetracked into another project. Thanks for the help, I'm sorry that this took SO long for me to respond.

|||

Here's an example. hopefully it helps:

http://agilebi.com/cs/blogs/jwelch/archive/2007/05/08/handling-flat-files-with-varying-numbers-of-columns.aspx

Handling bad data

I have an SSIS package that takes in a flat file and pushes the data into a table. However, once in a while the file will have some bad data in it (for example, this particular time I have too many delimiters on one line). I want the package to redirect the row to an error table and keep going with the processing for the rest of the data. To do this, I have hooked up an error output from the flat file source to an OLE DB destination and assigned all the rows to Redirect Row on Error in the Error Output section of the flat file source. Unfortunately, it does not work! The flat file receives an error and stops. Is this because something different has to happen when it is a problem with the whole row? Any help would be appreciated, thanks.Make sure you you visit the 'error output' page of the flat file source and change the value of the ERROR column to redirect row...|||Unfortunately, I've done this and it still doesn't work. I double checked and all of the columns in the Error Output section are set to Redirect Row.|||Perhaps you could have a first data flow that treats each line in the file as a single column; the with a script component you parse each row to count the number of delimiters and exclude those that exceeds the expected number. Just a thought...|||

That sounds like it should work! I haven't had a chance to work on it since I've been sidetracked into another project. Thanks for the help, I'm sorry that this took SO long for me to respond.

|||

Here's an example. hopefully it helps:

http://agilebi.com/cs/blogs/jwelch/archive/2007/05/08/handling-flat-files-with-varying-numbers-of-columns.aspx