Part of fact information arives later

View previous topic View next topic Go down

Part of fact information arives later

Post  hennie7863 on Mon Jan 11, 2010 3:55 pm

For a customer i have the following challenge: When fact data arrives it will be loaded into the fact table. But it's possible that some data of a specific fact is not yet available but it will arrive later (late arriving (PART OF) fact). I was thinking how to ETL this. May be someone could help me out. I can think of two options: dirty update of datawarehouse tables(hmmm) or if i know that certain recordinformation comes later, keep track of them. When the information is arrived load the complete data in the datawarehouse(hmmm). I not very convinced of one of both possible solutions...

Greetz,
Hennie

hennie7863

Posts : 31
Join date : 2009-10-19

View user profile

Back to top Go down

Late-arriving fact elements

Post  tim_huck on Tue Jan 12, 2010 5:20 pm

The best way to handle this depends on the volume of late-arriving data and its time span; the details for implementation depend on whether your users need to report from a "when the system knew it" vs. "when it really happened" perspective.

If the volume is relatively small and the time span is broad (i.e. more data about a small per cent of fact rows arrives late, from days to weeks or months or even years after the event), then I would consider updating the fact rows. Make sure your DBA is OK with this, as some DW databases are set up assuming no updates.

If the time span is short, such that facts that arrive more than a month or two after the fact are not important to users, one technique I've used successfully is to refresh the last month or last n months of data every day; this works in cases where users are used to the way data dribbles in over time and expect the recent numbers to be volatile. Some DBMS have features that make it practical to refresh part of the data for even very large fact tables (Oracle table partitioning for example). For this to work, you need to get sign-off from users that fact data that arrives more than some predetermined time ofter the fact won't be put into the fact table (at least not daily -- if the table is not huge you could set up monthly or weekly rebuilds to pick up facts that were too old for the daily process).

In some cases, you can take advantage of the SUM( ) . . . GROUP BY nature of reporting tools and add new rows that will enhance or correct the facts already there. These "artificial transactions" can mathematically reverse the fact numbers in the original transaction and sum to corrected values. The key thing here is ensure that the reversing row has exactly the same dimension keys as the original, so that it will be included in the SUM with the original in any possible GROUP BY; also required is that no-one is accessing the fact table directly (no end users with SQL query tools).

With any of these, I advise adding an "updated" date dimension, even if report users say they only want the most current data and don't care when it got added to or corrected. Debugging ETL problems without it will not be fun, and users can change the requirements and decide they do want to see "when we knew it" in addition to "when it happened".

With the "artificial transaction" technique, you get the ability to see both "what we knew then" as well as "what we know now".

An example of the "artificial transaction" technique: We have a customer shipments fact table that tracks customer, product, quantity shipped, quantity returned, date shipped, and estimated delivery date. For most shipments, we don't know the returns when we first pull their data. We can set return quantity to zero (true as far was we know it now). Our return policy allows the customer 30 days; most of them we know about within 3 to 4 business days of the delivery.

In a real fact table would be unit of measure data, plus probably shipper data and some other things such as a "return reason" dimension key; I'm only including enough to demo the technique.

here are the rows for a shipment of 2 products to a customer that returns one of them, where we learn about the returns 3 day later (date dimension keys in the form yyyymmdd so you can see what they mean). In real life, there would be some sort of "record type" dimension so that you can identify original, reversing, and corrected ones at least for ETL support; whether the end users can see them depends on the reporting requirements.

Customer_key product_key ship_dt_key est_delivery_dt_key qty_shipped qty_returned updated_dt_key
200103 489213 20100105 20100107 1 0 20100106
200103 123789 20100105 20100107 10 0 20100106
200103 489213 20100105 20100107 -1 0 20100110
200103 489213 20100105 20100107 1 1 20100110

--the 3rd row reverses the reverses the original for product 489213, the 4th row is the new corrected data as of Jan 10.

SUM . .. GROUP BY reports that don't reference "updated date" will show 1 shipped and 1 returned, whether they group by product, by customer, or both. If someone wants to know why the sales performance being reported today is not as good as we thought from reports run on Jan 8, filtering on updated date < Jan 8 shows that we were reporting 0 returns then. Including the updated date in query results will show the gory details of updates over time.

Hope this helps.

tim_huck

Posts : 8
Join date : 2009-04-09
Location : Evanston, Illinois

View user profile

Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum