Periodic snapshots - what is better?

View previous topic View next topic Go down

Periodic snapshots - what is better?

Post  kmorasoabi on Thu Mar 19, 2009 12:13 pm

Apologize in advance for a lengthy but general query I hope there is an expert who has done this many times that can help me.

We need to build a periodic snapshot table in our data warehouse. This is an example of a bank that needs to store account balances for every account for each of its millions of customers. On a daily basis, only about 10% of the accounts have any activity (transactions) on them, for the rest, the balance remains unchanged. In the data feed from the source system, we will only get updated balances for accounts that have had any activity on them. The challenge is to get the account balance for every account for any given day. Data volumes are very large - about 800 million records for the initial feed, 80 million records for the daily feeds of updated balances. Once a month, we will need to truncate and reload the fact table with a fresh "initial" feed.

Tables:
FACT TABLE:
Code:
FACT_ACCOUNT_BALANCE (DATE_KEY, ACCT_KEY, CUST_KEY, ACCOUNT_BALANCE)
DIMENSION TABLES:
Code:
DIM_TIME (DATE_KEY, )
DIM_ACCOUNT (ACCT_KEY, ACCOUNT_NUM,)
DIM_CUSTOMER (CUST_KEY, CUSTOMER_NAME, )

Question: What is the best way to store the data in the fact table?

1) Should we record rows for an account for days in which there are no transactions?

Data for FACT_ACCOUNT_BALANCE:

Code:
CUS_KEY   ACCT_KEY   DATE_KEY   ACCOUNT_BALANCE
777      123      20090301   2000.00
777      123      20090302   1980.00
777      123      20090303   1980.00 -- no transactions
777      123      20090304   1980.00 -- no transactions
777      123      20090305   1927.48
777      123      20090306   1927.48 -- no transactions
777      123      20090308   2000.41
... ... so on
What is the best way to implement this logic where a balance is carried forward if there are no changes? Is this very costly in terms of data storage?

2) Should we record rows only for changed balances?

Data for FACT_ACCOUNT_BALANCE:

Code:
CUS_KEY   ACCT_KEY   DATE_KEY   ACCOUNT_BALANCE
777      123      20090301   2000.00
777      123      20090302   1980.00
777      123      20090305   1927.48
777      123      20090308   2000.41
... ... so on.

What is the best way to query for a balance on a particular day e.g. What was the balance on 2009-03-04 for this account?

kmorasoabi

Posts : 3
Join date : 2009-03-19

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  BoxesAndLines on Thu Mar 19, 2009 7:30 pm

1. Yes. It's a snapshot. It should contain a row for each and every account. As far as storage, only carry a weeks worth of daily snapshots. After that, carry the end of the month to limit the amount of data required.

2. Don't do that. There's no good answer for your question.

800 million is quite large. What is your database? What is your partitioning strategy? What's your ETL update strategy? Are you planning on updating a 800M row table?
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  kmorasoabi on Thu Mar 19, 2009 8:38 pm

BoxesAndLines wrote:
2. Don't do that. There's no good answer for your question.

800 million is quite large. What is your database? What is your partitioning strategy? What's your ETL update strategy? Are you planning on updating a 800M row table?

Oracle database, daily partitions, no updates to the fact table - insert new balances only. The reason this option is in consideration is the large amount of data and the occurrence of a large number of zero balance accounts. Does it make sense to recreate a zero balance row every time and take it forward in time (the account is not closed)?

kmorasoabi

Posts : 3
Join date : 2009-03-19

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  BoxesAndLines on Thu Mar 19, 2009 11:24 pm

kmorasoabi wrote:
BoxesAndLines wrote:
2. Don't do that. There's no good answer for your question.

800 million is quite large. What is your database? What is your partitioning strategy? What's your ETL update strategy? Are you planning on updating a 800M row table?

Oracle database, daily partitions, no updates to the fact table - insert new balances only. The reason this option is in consideration is the large amount of data and the occurrence of a large number of zero balance accounts. Does it make sense to recreate a zero balance row every time and take it forward in time (the account is not closed)?

That, my friend, is a question for the business. But, if you want your metrics additive, you'll need to include all accounts. You are correctly concerned though, 800M rows on Oracle is way too much. I'm not sure it will work. Forget about joining to any dimension of significant size. I think I'm all out of magic model elixir today.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  Michael_K on Mon Mar 23, 2009 8:37 am

I need to threadjack here because this question is very similar to mine (about healthcare) in making a mental leap to fully understanding snapshots. It fits in with exactly why you're asking about your first scenario, so I'm not jacking it too much...

In kmorasoabi's first situation he's storing a line for the current balance for every day. So, 3/3/09 and 3/4/09 have the same balance.

My question is how is this then additive? I think that kmorasoabi understands that, but then boxesandlines mentions that it's additive and I'm fuzzy on how. Each day is a snapshot of the account and represents the story of point in time. If the customer had more than one account, then each should be represented on each day so you can get a total for the customer, but it is not additive over the time dimension, right?

Or am I totally missing something?

Michael_K

Posts : 7
Join date : 2009-03-14

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  BoxesAndLines on Mon Mar 23, 2009 8:52 am

It's additive within the snapshot. It also could be a factless fact table in that I only include active accounts. That's how I count "billable lines in service". I only include active accounts in the fact table. When you start comparing across snapshots, you are doing trend analysis. If the OP wanted to throw out zero balance accounts, then his fact table would be active accounts with a balance > 0. He would then be forced to get total accounts from the account dimension.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  Michael_K on Mon Mar 23, 2009 3:23 pm

Excellent. That puts me on the right track. Thank you!

Michael_K

Posts : 7
Join date : 2009-03-14

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  VHF on Wed Apr 29, 2009 11:55 am

I appologize for the late response, and I'm certainly not an expert, but I did recently attend the Kimball University "Dimensional Modeling in Depth" course. Ralph Kimball presesented a pattern called "Instantaneous Balances" that seems applicable to the OP:

Instead of a periodic snapshot, one can maintain a transactional fact table and -- with the addition of a pair of date fields -- be able to retrive account balances on any given day. You could think of it as a "Type 2 Fact Table". A transaction fact record (with an account balance) represents the state of an account until another transaction occurs for that account. The trick is being able to retrieve the account balance for any given day, including those where they may have been no transaction. Two additional date fields are required, a begin date (when the transaction occurs, or becomes effective) and an end date (which is set when the next transaction occurs for that account.) These are actual database datetime fields, not surrogate keys into the date dimension.

Here's the OP's fact table with the two additional fields:

FACT_ACCOUNT_BALANCE (DATE_KEY, ACCT_KEY, CUST_KEY, ACCOUNT_BALANCE, BEGIN_EFF_DATETIME, END_EFF_DATETIME)

When the first transaction record for an account is inserted into the fact table, the BEGIN_EFF_DATETIME is set to the date of the transaction (can include time component if there can be more than one transaction per day for an account.) The END_EFF_DATETIME is set to a "far distant future" high value (for example, 12/31/2999). It is import to use a future date, not a null or a date in the past.

When it is time to insert a subsequent transaction for that account into the fact table, the previous transaction fact record is updated to set its END_EFF_DATETIME to the effective date of the new transaction. The new transaction record is then inserted with its BEGIN_EFF_DATETIME value set to its effective date and its END_EFF_DATETIME set to the high value. The ending effective timestamp of one transaction must always match the beginnnig effective timestamp of the next transaction for that account--no gaps, no overlap.

What this pair of effective dates gets us is the ability to query balances as of any date and time (time can be omitted if you don't care about time of day):

SELECT SUM(ACCOUNT_BALANCE)
FROM FACT_ACCOUNT_BALANCE
WHERE #April 15, 2009 12:31:00# >= BEGIN_EFF_DATETIME
AND #April 15, 2009 12:31:00# < END_EFF_DATETIME


Note that the first condition is always >= and the second is <. (We cannot use the BETWEEN keyword for this query.)

In cases where where a large percentage of accounts have no activity on a given day, the "Instantaneous Balances" approach offers a reduction in storage compared to a periodic snapshot. On the other hand, if the majority of accounts frequently have many transactions throughout the day the storage requipment would be greater.


Last edited by VHF on Wed Apr 29, 2009 12:01 pm; edited 2 times in total (Reason for editing : improve grammer)

VHF

Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US

View user profile

Back to top Go down

Re: Periodic snapshots - what is better?

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top


 
Permissions in this forum:
You cannot reply to topics in this forum