Data Flows
3 posters
Page 1 of 1
Data Flows
Hi Experts-
I'm new to DWH projects, but having good experience in Implementing Mainframe application developement projects. My lead asked me to provide data flows in the below scenarios.
1 Error Handling
So far I'm having the below flow,
Source table/File ==> Data Checks ==> Data Transformations
Data records failed during the Data checks will be send to the ERROR folder in the server, in the same format of the source table.
2 Source Table/File Not available/Validation/Archival
Need to have a flow If the required source table/file is not available in the specified folder.
Need to have a flow to validate a Source table/file for our requirements, like the table/file should be having the required columns. If the file/table not having the required columns should be errored out.
Need to have a flow to archive the source tables/files after a specied scheduled time slots.
Please provide your great help/suggestions in Implementing or drawing the above flows with good practices.
Thanks in advance,
-Balas
I'm new to DWH projects, but having good experience in Implementing Mainframe application developement projects. My lead asked me to provide data flows in the below scenarios.
1 Error Handling
So far I'm having the below flow,
Source table/File ==> Data Checks ==> Data Transformations
Data records failed during the Data checks will be send to the ERROR folder in the server, in the same format of the source table.
2 Source Table/File Not available/Validation/Archival
Need to have a flow If the required source table/file is not available in the specified folder.
Need to have a flow to validate a Source table/file for our requirements, like the table/file should be having the required columns. If the file/table not having the required columns should be errored out.
Need to have a flow to archive the source tables/files after a specied scheduled time slots.
Please provide your great help/suggestions in Implementing or drawing the above flows with good practices.
Thanks in advance,
-Balas
Balas- Posts : 11
Join date : 2012-04-26
Re: Data Flows
Hi Balas,
Here are my suggestions:
1 Error Handling
I load data from files into SQL staging tables as the first step, with every column defined as text e.g. nvarchar(4000). Error handling can then be managed with additional columns on the staging tables, rather than creating more files. Data management in a RDBMS is always going to be easier and more robust, and you open up powerful querying capabilities to understand the input data.
I use the "aggressive load" strategy so there typically isn't any condition that will make me "error" an entire record. Invalid data is substituted on a column-by-column basis with 'Unspecified' / NULL etc as appropriate, but the remainder of the data in that record will still be loaded.
2 Source Table/File Not available/Validation
This is typically rare so I don't build anything specific for it. Default error handling is usually fine e.g. package fails, reports error e.g. column X not found in table Y. I write processing Logs to a SQL table for best management and querying functionality.
Archival
Ref my first point above - if data is loaded to Staging tables you can add extra columns to manage when to archive, and use standard etl tools to move archived data around as required.
Good luck!
Mike
Here are my suggestions:
1 Error Handling
I load data from files into SQL staging tables as the first step, with every column defined as text e.g. nvarchar(4000). Error handling can then be managed with additional columns on the staging tables, rather than creating more files. Data management in a RDBMS is always going to be easier and more robust, and you open up powerful querying capabilities to understand the input data.
I use the "aggressive load" strategy so there typically isn't any condition that will make me "error" an entire record. Invalid data is substituted on a column-by-column basis with 'Unspecified' / NULL etc as appropriate, but the remainder of the data in that record will still be loaded.
2 Source Table/File Not available/Validation
This is typically rare so I don't build anything specific for it. Default error handling is usually fine e.g. package fails, reports error e.g. column X not found in table Y. I write processing Logs to a SQL table for best management and querying functionality.
Archival
Ref my first point above - if data is loaded to Staging tables you can add extra columns to manage when to archive, and use standard etl tools to move archived data around as required.
Good luck!
Mike
Re: Data Flows
I agree with Mike on the 'aggressive load' approach. In fact, I have never built an etl system that rejected rows for any reason. The only reject would be entire batches due to corruption in the source file.
I've always wondered why so much emphasis is given to the pre-process. I suspect a trickle down from BASEL and Sarb-Ox. The thing is, unless you have a really crappy data source, the DW should accurately reflect the source for better or worse, and not spend an inordinate amount of effort 'validating' what should already be valid. The thing is, you don't conduct business in the data warehouse, rather you track what business was conducted. If the systems by which you are conducting business are not correct, why should the DW be?
Data auditing and validation in the DW should be a post-load process that provides feedback to correct the data in the source system. If you reject incorrect data from the source, you cannot use the reporting facilities in the DW to report and monitor it.
I've always wondered why so much emphasis is given to the pre-process. I suspect a trickle down from BASEL and Sarb-Ox. The thing is, unless you have a really crappy data source, the DW should accurately reflect the source for better or worse, and not spend an inordinate amount of effort 'validating' what should already be valid. The thing is, you don't conduct business in the data warehouse, rather you track what business was conducted. If the systems by which you are conducting business are not correct, why should the DW be?
Data auditing and validation in the DW should be a post-load process that provides feedback to correct the data in the source system. If you reject incorrect data from the source, you cannot use the reporting facilities in the DW to report and monitor it.
Similar topics
» How do cash flows fit into a DW model?
» Looking for a Data Architect/Data Modeler for NYC Big Data Startup
» difference between data mart and data warehouse at logical/physical level
» clickstream fact data coming in with different levels of dimensional geography data
» Data warehouse / data retention strategy - ERP upgrade and consolidation
» Looking for a Data Architect/Data Modeler for NYC Big Data Startup
» difference between data mart and data warehouse at logical/physical level
» clickstream fact data coming in with different levels of dimensional geography data
» Data warehouse / data retention strategy - ERP upgrade and consolidation
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|