Data Flows

Balas Mon Apr 30, 2012 9:56 am

Hi Experts-

I'm new to DWH projects, but having good experience in Implementing Mainframe application developement projects. My lead asked me to provide data flows in the below scenarios.

1 Error Handling

So far I'm having the below flow,

Source table/File ==> Data Checks ==> Data Transformations

Data records failed during the Data checks will be send to the ERROR folder in the server, in the same format of the source table.

2 Source Table/File Not available/Validation/Archival

Need to have a flow If the required source table/file is not available in the specified folder.

Need to have a flow to validate a Source table/file for our requirements, like the table/file should be having the required columns. If the file/table not having the required columns should be errored out.

Need to have a flow to archive the source tables/files after a specied scheduled time slots.

Please provide your great help/suggestions in Implementing or drawing the above flows with good practices.

Thanks in advance,
-Balas

Mike Honey Mon Apr 30, 2012 8:21 pm

Hi Balas,

Here are my suggestions:

1 Error Handling

I load data from files into SQL staging tables as the first step, with every column defined as text e.g. nvarchar(4000). Error handling can then be managed with additional columns on the staging tables, rather than creating more files. Data management in a RDBMS is always going to be easier and more robust, and you open up powerful querying capabilities to understand the input data.

I use the "aggressive load" strategy so there typically isn't any condition that will make me "error" an entire record. Invalid data is substituted on a column-by-column basis with 'Unspecified' / NULL etc as appropriate, but the remainder of the data in that record will still be loaded.

2 Source Table/File Not available/Validation

This is typically rare so I don't build anything specific for it. Default error handling is usually fine e.g. package fails, reports error e.g. column X not found in table Y. I write processing Logs to a SQL table for best management and querying functionality.

Archival

Ref my first point above - if data is loaded to Staging tables you can add extra columns to manage when to archive, and use standard etl tools to move archived data around as required.

Good luck!
Mike

ngalemmo Tue May 01, 2012 10:28 am

I agree with Mike on the 'aggressive load' approach. In fact, I have never built an etl system that rejected rows for any reason. The only reject would be entire batches due to corruption in the source file.

I've always wondered why so much emphasis is given to the pre-process. I suspect a trickle down from BASEL and Sarb-Ox. The thing is, unless you have a really crappy data source, the DW should accurately reflect the source for better or worse, and not spend an inordinate amount of effort 'validating' what should already be valid. The thing is, you don't conduct business in the data warehouse, rather you track what business was conducted. If the systems by which you are conducting business are not correct, why should the DW be?

Data auditing and validation in the DW should be a post-load process that provides feedback to correct the data in the source system. If you reject incorrect data from the source, you cannot use the reporting facilities in the DW to report and monitor it.

Sponsored content

Data Flows

Data Flows

Re: Data Flows

Re: Data Flows

Re: Data Flows