Incremental Accumulating Snapshot ETL

mellardc Tue Apr 10, 2012 6:16 pm

Hi all,

This is my first post here and I've searched but can't find an answer to tackle the specific problem I'm dealing with here. (I think I have a solution in part but it's pretty inelegant so wondered if there was a pattern for this problem already).

I'm building an accumulating snapshot fact table as I need to track an order throughout it's lifecycle. Other constraints are that this has to be near real time which mandates my loads are incremental rather than full refreshes (this constraint gives rise to my real problem as you'll see in a minute). The fact table sources it's data from multiple transactional tables that form a 1:M relationship, but for simplicities sake let's say there are two source tables - OrderHead and OrderLines. Let's also assume both tables contain meaningful facts. The requirement is to report on the facts as they appear - and an order header fact can precede an order line fact.

So the ETL has to perform a left outer across these two tables such that we pick up OrderHead facts before OrderLines are posted. The idea here because these facts are related in a 1:M and share 90% of the same dimensions I am attempting to load them to the same Fact table. The problem arises when you through in incremental loading - i.e. how does the ETL realise when it has already loaded a row in this left outer join situation. The natural key of the Fact table is obviously split across both sources tables (for arguments sake, OrderHeaderNumber + OrderLineNumber). Basically, how does the ETL distinguish between an Order header that is never going to have Order Lines associated with it versus an Order header who just doesn't have any Order Lines at the time the extract was run. If just left to its own devices and checking against the natural key of OrderHeaderNumber + OrderLineNumber you initially end up with OrderHeaderNumber = 12345, OrderLineNumber = NULL/-1 (or whatever you want to use to signify absence) then at a later date you end up with OrderHeaderNumber 12345, OrderLineNumber = 1 and OrderHeaderNumber 12345, OrderLineNumber = 2 etc. So you have 3 fact records when you should really only have 2.

Solutions I've proposed to myself thus far are,

1) Separate these out into their own accumulating fact tables (the downside here is I'm going to have one Fact table per source transactional table (around 6 tables in this particular problem)- maintenance a pain and multi Fact table queries are going to slow down my BI reporting. This solution basically removes the left outer join problem and separates out the OrderHead and OrderLine natural keys into two smaller fact table buckets, one fact table naturally keyed on OrderHeaderNumber and the other one naturally keyed on OrderHeaderNumber+OrderLineNumber.

2) Perform a cleanse of the incremental data in the single Fact table prior to loading. This is based on the thinking that a full Fact table refresh does not suffer the same problem, i.e. we don't care about identifying what has loaded already versus what hasn't. But rather than do a full refresh we simply remove all rows in the Fact table that we are attempting to incrementally update. This would have to be keyed around the OrderHead natural key (i.e. the first Fact metrics to appear in the fact table and the base table from which we left outer join). This would obviously mean we're removing a lot of records that we may just place back exactly as they were - but it stops us loading duplicates caused by the left outer join. I could make this pre-removal of fact table records reasonably efficient in how quickly it runs - but it means I don't have a really good audit of what has happened to that row (from an ETL point of view). [Also wondered if this is even an accumulating snapshot as the record is essentially being ripped out of the DW and replaced with potentially several others - this is physical however, but logically we're still tracking an order throughout it's lifecycle]

3) Given this is near real time reporting - determine how far back records are really required (hopefully they're talking a matter of days). And then hope the record count of the number of real time rows is so low that I can perform a full refresh of the fact table. The analytical side of Orders can then probably reside in a different Fact table that undergoes a full refresh every night and contains all orders for all time.

I'm sure the same problem arises when you have on source table to one Fact table but where you have potentially NULLable elements in the natural key that are updated as the object 'matures' - bad OLTP design I'm sure but theoretically it could happen.

I'm sure there's something obvious I've missed but I can't see the wood for the trees at the minute - help!

Chris

ngalemmo Tue Apr 10, 2012 6:49 pm

First off, it is never a good idea to mix grains in a fact table. If there are truly unique measures in a header that cannot be stated at the line level, then you are dealing with two fact tables. This should also simplify your problem as you can deal with header and line changes independently.

Even if you do put everything in one table and use a bogus line number (such as -1) for the header measures (not recommended), the update of header measures and line measures are still independent issues and should be addressed separately.

mellardc Tue Apr 10, 2012 7:00 pm

Told you I couldn't see the wood for the trees - I'd never even thought about this as me trying to mix the grain - I was straight into the ETL and only did a very quick dimensional model in my head..... bad bad bad I know. Pretty obvious when you stand back and look at it!

I suspect it might not be as bad as one Fact table per source table - but the real 'Fact' they're trying to capture here for the main part is existence of a record and it's associated dimensions. If I can push/persuade them down the route that there is no significant information they would want to glean from pure header records then I can probably half the number of Fact tables here.

Thanks for the reply - my eyes are now open

Sponsored content

Incremental Accumulating Snapshot ETL

Incremental Accumulating Snapshot ETL

Re: Incremental Accumulating Snapshot ETL

Re: Incremental Accumulating Snapshot ETL

Re: Incremental Accumulating Snapshot ETL