Data Cleaning - replacing nulls and addresses

View previous topic View next topic Go down

Data Cleaning - replacing nulls and addresses

Post  jryan on Wed Oct 20, 2010 4:03 pm

Hi,

I need to do some data cleaning as part of a data warehouse project that I'm about to start.

Some of the tasks will include some basic data cleaning (e.g. handling of nulls & replacing with a default value). Then there will be more complex data cleaning required for some address data.

I'll have a staging db which is doing an extract from the source SQL system, plus the main warehouse SQL database. I'll be using SSIS as the ETL tool. Could someone please tell me if the above data cleaning would be best handled - would it be in the staging database, or in the warehouse database?

I'm tempted to say it wont make too much difference, but I'm keen to follow best practice

Thanks!

jryan

Posts : 33
Join date : 2010-09-27

View user profile

Back to top Go down

Re: Data Cleaning - replacing nulls and addresses

Post  ngalemmo on Wed Oct 20, 2010 4:28 pm

I am not a big fan of cleansing data going into a data warehouse. It's a big waste of time and effort because it does nothing to correct the root cause of the problem... bad data in the source.

A data warehouse is a great tool for moving an organization towards better management of the data because it exposes data issues that are otherwise unknown. Proper data cleansing should be the result of a feedback loop from the data warehouse back to the source system. Use the data warehouse to identify and expose data issues, work out a resolution, then apply that resolution to the source system. Once the source system is updated, the corrected data flows back into the data warehouse as part of the normal ETL process.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Data Cleaning - replacing nulls and addresses

Post  jryan on Wed Oct 20, 2010 4:42 pm

Thanks for the quick reply,

I do agree - and I'm planning to implement Kimball style "screens" as per the ETL book, that will audit the data coming in for problems, then write to the error event fact, which can then be presented to the appropriate teams to correct in the source systems.

However, I have a host of DQ issues that I dont want to present to the user. E.g. a date (perhaps customer dob) that is out of range, or a textual attribute that is blank. For these items I want to replace them with default values. Do you think that activity woudl be better done in the Staging db or in the warehouse db?

jryan

Posts : 33
Join date : 2010-09-27

View user profile

Back to top Go down

Re: Data Cleaning - replacing nulls and addresses

Post  BoxesAndLines on Thu Oct 21, 2010 9:48 am

Staging database or ODS? I view a staging database as temporary work area for ETL processes. There is no persistent data stored here. The ODS should not be cleansed as it should reflect the source system data in order to support NRT application reporting.

In an ideal world ngalemmo is correct. Unfortunately, this is not an ideal world. Application teams routinely ignore data quality problems because it is not important to their domain. Show up with a list of invalid addresses and they'll stick the request at the bottom of the stack. I agree with your approach on tracking what was cleansed to provide feedback to the business on DQ problems.

If your invalid data is dimension data, you should associate the fact to a special dimension row to identify these errors. If the data is a null fact, be very cautious to replace with a zero. Zero is a valid value for many facts. Zero is also treated differently than null by database operations.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Data Cleaning - replacing nulls and addresses

Post  sac587118 on Fri Dec 10, 2010 5:25 am

As per my knowledge Staging area can be used for data cleaning process. Its better to go with staging db.
If you want to present cleaned data to custormer before loading into warehouse then can go for ODS.
Customer can validate ODS data and if updates required on data , have proper update and then load into warehouse.
avatar
sac587118

Posts : 2
Join date : 2010-12-10

View user profile

Back to top Go down

Re: Data Cleaning - replacing nulls and addresses

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum