ETL Optimization Scenario

View previous topic View next topic Go down

ETL Optimization Scenario

Post  Mohsin on Thu Apr 02, 2009 5:36 am

Hi All,
Currently size of our fact table is in millions which will continue to expand and pretty soon our ETL processes will fail, so need to optimize them now before it gets out of hand.

So we need to create a scenario which will allow was to test our ETL processes.

Can anyone tell me, how can I cook up such a scenario.

Mohsin

Posts : 4
Join date : 2009-03-03

View user profile

Back to top Go down

Re: ETL Optimization Scenario

Post  BoxesAndLines on Thu Apr 02, 2009 1:59 pm

I'm not even sure where to begin on this. My recommendation is to hire a senior ETL architect. The depth of your problems exceed the level of detail required in an online forum.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Try replicating data with shifted dates

Post  tim_huck on Thu Apr 09, 2009 2:10 pm

I agree that if you've implemented an ETL process that won't scale up as needed you have a bigger problem than just testing.

But, to test with a data set larger than the source application currently has is not too hard. There is usually one (or at most a few) key dates that can be manipulated to send the same set of fact data through the ETL process multiple times. If your source has, for example, 3 years of data, you can double the data size for testing via a view that does a straight select of the source table UNION ALL with a select that adds 37 months to each important date column and just selects all the other columns.

If the older data in the source system is not typical of what new data will look like (not unusual), you can produce pretty good test data via a view that has UNION ALL of multiple selections of the most recent data, adding a different number of months in each query.

Depending on the ETL tools and techniques available, you may be able to do the same thing within the ETL process rather than using a view.

Potential gotchas:

1) Unique transaction ID -- there may be an identifier in the data that is checked for uniqueness in the ETL and/or has a unique constraint in the DW. For testing, change the unique definition to include that date (or one of the dates) being shifted to manufacture test data.

2) Special dates -- the main date used for this is a transaction or effective date that always has a valid recent date value in it. But if you shift other dates too, check for the possibility of NULL or special values that need to be preserved rather than adding to. Some applications use NULL when the date is missing, unknown, or does not apply, others use special date values that should not be altered (1900/01/01 and 9999/12/31 are popular). Most databases will throw an error if you add anything to 9999/12/31.

This technique enables scale-up testing without needing to generate and store any source data in addition to what already exists. You can test the ETL "pipe" without having extra storage in the DW by taking the data through the process up to but not including the actual final load. That way, when you do add storage for load testing, you'll know that you will actually get to use it.

Regards,
Tim

tim_huck

Posts : 8
Join date : 2009-04-09
Location : Evanston, Illinois

View user profile

Back to top Go down

Re: ETL Optimization Scenario

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum