Design Tip #149

View previous topic View next topic Go down

Design Tip #149

Post  RichardAtkins on Wed Oct 03, 2012 8:59 am

Hello,

I am a bit confused by the solution described in design tip 149. I hope someone can point me to how this would work.

The design tip is regarding a rebuild of a dimension and how to synch the associated fact(s) with the rebuilt dimension. The tip uses an example introducing additional history to the dimension therefore creating more records - assuming the data could be sourced to achive that. In doing this the dimension PK (customer key in this example) is redefined.

From just over half-way down the text I quote the following;
"write a query that joins the old fact table to the new dimension table on the transaction system account number and transaction date, picking up the new surrogate key."

The part that confuses me is the join, the concept of loading the fact into a new table makes perfect sense, however I assumed the problem would be how to know which dimension record to use for the fact... since none of the fact attributes exist in the dimension and the PK in the dim no longer exists.

I would have thought the solution would be to load the new fact table by joining the old fact to old dimension using the ordinary relationship and then join from the old Dim to the new Dim on the columns that are unique (except for those that were altered and the PK itself) - hoping for a business key plus date for example. The solution given in the design tip joins the old fact to the new dim on account number and transaction date - surely the transaction date would not exist in the customer Dim - why would it?

RichardAtkins

Posts : 2
Join date : 2012-10-03

View user profile

Back to top Go down

Re: Design Tip #149

Post  Joy on Wed Oct 03, 2012 1:07 pm

Excellent question, Richard. Our design tips are designed to be short, so I couldn't devote a lot of space to this part of the issue. This forum is a great place to do that.

Your question, as I understand it, is "How do I find the new surrogate key for each of my old fact rows?"

Let's be more explicit about what our tables look like:

MyFactTable
transaction_date_key
product_key
customer_key
account_number (discussion point #1 below)
facts...

DimCustomer (old)
customer_key (pk)
account_number (business or transaction system key)
credit_score
many other attributes...
row_start_date
row_end_date
row_current_indicator


DimCustomerNew
new_customer_key (pk)
account_number (business or transaction system key)
original_credit_score
credit_score_band (bad, medium, good, great)
many other attributes...
row_start_date_key (discussion point #2 below)
row_end_date_key
row_current_indicator


Clearly, if we did have account_number in our fact table, this would be logically easy to formulate:

SELECT f.date_key
, f.product_key
, n.new_customer_key AS customer_key
, f.account_number
, f.facts...
FROM MyFactTable f
INNER JOIN DimCustomerNew n
ON (f.account_number = n.account_number AND f.transaction_date_key BETWEEN n.row_start_date_key AND n.row_end_date_key)

Now, the BETWEEN logic is going to create a performance hit for sure. But remember this is a one-time thing while we're restating the fact table, not something that every analytic query needs to do.

If I'm really worried about that query performance, I may build a table in my staging area that looks like:

CustKeyStaging
date_key (pk)
account_number (pk)
new_customer_key
... and NO OTHER COLUMNS

This puppy would have many more rows in it than the dimension table: one row for each date for each account. It would exist only during the conversion project, to return the correct new_customer_key via a simple join without having to use the BETWEEN syntax. I'd think about doing this if I had a ton of fact tables and/or really large data volumes. I'd test it in each situation to ensure it really would make the process go faster; clearly it depends on how much history you have.

Discussion point #1: What if I don't have the account_number in the fact table?
The classic design for a fact table has it consist ONLY of the foreign keys to dimension tables + degenerate dimensions such as order number + fact columns (usually numeric). Classically we don't include the dimension table business keys in the fact table. I put it in there primarily for the purpose of starting with a simpler problem.

If we're creating a new design, I'm inclined to put the business key (or an unchanging DW-managed surrogate) into the fact table in addition to the regular dimension surrogate key. I actually do this fairly often for my core Type-2 dimensions. You have to admit that carrying around an additional integer field certainly would pay off for future restatements of the fact table keys, at a very modest cost. I'd probablay hide it from the business users, by the way.

But of course what if you don't have it? Well, you could always get it by joining to the old dimension table on the old surrogate key, pick up the account number, and use that to join to the new dimension table using the same BETWEEN logic described previously. So the query is a 3-way join rather than a 2-way join. The performance of this query should still be fine.

Discussion point #2: row_start_date_key and row_end_date_key in the dimension tables.
I usually don't make the row start date and row end date in the dimension tables be surrogate keys. Instead I usually leave them as date datatypes. (Why? because I'm old and I've always done it that way.) If you do the same, you can solve the problem in one of several ways:
- Convert the fact table transaction_date_key to a date, by joining to the date dimension (or decoding a smart date_key)
- Convert the dimension start and end dates to keys, by joining 2x to the date dimension (or decoding a smart date_key)
If I were really worried about performance, I'd probably choose the second option and instantiate it in the staging area.


Last edited by Joy on Wed Oct 03, 2012 1:09 pm; edited 1 time in total (Reason for editing : clarification)
avatar
Joy

Posts : 20
Join date : 2009-02-03
Location : Kimball Group

View user profile http://www.kimballgroup.com

Back to top Go down

Thank you for the explanation

Post  RichardAtkins on Thu Oct 04, 2012 4:30 am

Joy - all of that makes sense. I agree with the need to address performance, however as you say this is a one-time activity. I myself prefer the method of joining to the old dim and then to the new one for safety - especially if there is a risk the fact attributes to be used as a join may not have been kept in synch with a type-1 dim attribute.

RichardAtkins

Posts : 2
Join date : 2012-10-03

View user profile

Back to top Go down

Re: Design Tip #149

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum