SCD 2 scenario

View previous topic View next topic Go down

SCD 2 scenario

Post  neo.helios on Tue Aug 09, 2011 11:13 am

This could be a typical scenario in DWH, but I would like to know what is the best way to handle that.

I have a dimension with SCD Type 2 - Using Effective and Expiry Dates. This dimension key is referred in a fact table.

When there is any change in the SCD attributes of the dimension in source, a new record will be inserted in the dimension with a new dimension key generated. Now the data that comes henceforth in the fact will have the new dimension key as the reference.

When I try to join like

Select x,y,z from fact, dimension
where fact.dimension_key = dimension.dimension_key
and dimension.expiry_date is null

then I'll get only the data associated with the new dimension key. And if I do not use the condition "dimension.expiry_date is null" then I'll get duplicate data. What is the best approach to overcome this

Thanks in Advance..........

neo.helios

Posts : 11
Join date : 2010-11-02

View user profile

Back to top Go down

Re: SCD 2 scenario

Post  ngalemmo on Tue Aug 09, 2011 11:58 am

How would you get duplicate data? Are you reinserting fact rows with the new dimension key?

Using NULL as an expiration date is a bad idea. It makes queries more complicated than they need to be. Common practice is to use a distant future date, such as 1/1/3000. This way you can use BETWEEN to locate a particular row based on date.

When you join a fact to a type 2 dimension, you join on the key only. If you need a particular version of the dimension row based on date, you perform a self join on the dimension using the natural key and a predicate on the date.

Joining from a fact to a dimension on the key should never give you duplicate data unless you are doing something wrong in the load.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: SCD 2 scenario

Post  neo.helios on Tue Aug 09, 2011 12:25 pm

Thanks for the reply. I understand the way to join the fact and dimension to some extent. Would be great if you can detail it out.

I'll try to ask my question with a scenario :
I have a record in the customer dimension for a customer by name "John Werner". The structure of the dimension would be
Customer_Key Customer_Name Customer_City Effective_DateExpiry_Date
10 John Werner NJ 01-Jan-2011 30-Jun-2011
20 John Werner CA 30-Jun-2011 Null
The new dimension key 20 is inserted when the customer city changed

I have fact table with the structure
Customer Key Date Product Amount
10 15-Jan-2011 XYZ 100
10 12-Mar-2011 ABC500
20 11-Jul-2011 XYZ 300
20 01-Aug-2011 ABC200


If I have to get the sum of amount by customer then If I write a query

Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
and dim.expiry_date is null

then for John Werner I'll get the data like

John Werner XYZ 100
John Werner XYZ 300
John Werner ABC 500
John Werner ABC 200

This is what I was referring to when I said duplicate data. I might be missing something here... Can you please explain

Regards

neo.helios

Posts : 11
Join date : 2010-11-02

View user profile

Back to top Go down

Re: SCD 2 scenario

Post  ngalemmo on Tue Aug 09, 2011 12:50 pm

If your query is

Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
and dim.expiry_date is null

you will only get back

John Werner XYZ 300
John Werner ABC 200


If you query

Select dim.customer_name, fact.Product, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key

John Werner XYZ 100
John Werner XYZ 300
John Werner ABC 500
John Werner ABC 200

But what is duplicated??? Other than the fact that your query is missing a GROUP BY clause (I don't see how you could actually run the query as is, every DB I know would give you a syntax error), the measures are correct.

With a group by the result should be:

John Werner XYZ 400
John Werner ABC 700
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: SCD 2 scenario

Post  hang on Tue Aug 09, 2011 6:37 pm

I guess I see where your problem is and you may want to see this:

John Werner 1100

The key is, you always should group by the natural key, Customer_ID if you have one, or whatever determines the unique customer in the source system, similar to this:

Select dim.customer_name, sum(fact.amount)
from dimension dim, fact fact
where dim.customer_key = fact. customer_key
Group by customer_name -- suppose Customer_name is the NK.

However, if you want to see sales by customer and product, you then need to put back product in the select and group by clauses, so you would end up with only two rows as ngalemmo suggested.


Last edited by hang on Tue Aug 09, 2011 8:52 pm; edited 1 time in total

hang

Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia

View user profile

Back to top Go down

Re: SCD 2 scenario

Post  LAndrews on Tue Aug 09, 2011 7:00 pm

I'm pretty sure the where clause "and dim.expiry_date is null" will result in

John Werner 500

As ngallemo mentioned, the effective dates on the dimension are primarily used for ETL purposes (choosing the correct SK), not when generating queries.


LAndrews

Posts : 132
Join date : 2010-05-13
Location : British Columbia, Canada

View user profile

Back to top Go down

Re: SCD 2 scenario

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum