De-normalizing Customer Information to create a Customer Dimension

View previous topic View next topic Go down

De-normalizing Customer Information to create a Customer Dimension

Post  mea0730 on Sat May 15, 2010 4:11 am

I have a question regarding de-normalizing customer data to create a single customer dimension!

Let's say our OLTP customer information is made up of 2 tables.

1) Customer Table (customer name, customer type, etc...)
2) Address Table (address1, address2, city state, zip, etc...) with multiple addresses per customer.

Obviously I would have repeating customer data for each address.

But how would the natural key work? Wouldn't I need to store 2 natural keys in my customer dimension (CUST_ID, ADDR_ID) to uniquely identify the customer record in order to "pull" the surrogate key at ETL time? (or if I'm using a lookup table in this case it would need 3 keys - "surrogate_key, cust_id, addr_id")

When multiple tables are de-normalized into a single dimension, you would have to include the primary keys from each table as natural keys in the de-normalized dimension, right?

Thanks,
Mike

mea0730

Posts : 21
Join date : 2010-05-15
Location : San Jose, CA

View user profile

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  BoxesAndLines on Sat May 15, 2010 10:35 am

I actually prefer to keep an address dimension separate from the entities that use address information. If you combine the two, you actually have a many to many relationship. A customer can have many addresses and An address can be used by many customers. You have to work out in your single customer dimension how many addresses you will carry. If you go the single dimension route, you don't necessarily need the addr_id on the customer dimension unless your tracking history on address columns. The addr_id in this case will easily identify address changes as opposed to building a checksum on the address columns to identify changes. If you go two dimensions then you will need the addr_id. Personally, in either case I would keep it.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  mea0730 on Mon May 17, 2010 1:01 am

That's a good point! So if I keep it as two dimensions, the fact table would have 2 FKs. CUST_KEY and ADDR_KEY? You're not proposing any kind of snowflake design, right?

Maybe there really would not be many situations where you would really need to denormalize multiple tables into one! Might be best to always have seperate dimensions!

Thanks for the response!!

mea0730

Posts : 21
Join date : 2010-05-15
Location : San Jose, CA

View user profile

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  hang on Mon May 17, 2010 9:31 am

I would denormalise them into one customer table. Very likely the multiple addresses are the fact that customers have moved to different places in the history, than you need to rebuild the historical changes using SCD2 in customer dimension. If one customer could have more than one record in the address table at the same point of time, you should have 3 tables in OLTP system to have a workable many to many relationship.

I think treating the address as a set of attributes in the customer dimension, just like other customer attributes, makes the model much simpler. Keeping address in a standard format is not easy task, and hence making it a dimension may end up with slightly different text for the same address being treated as different addresses.

hang

Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia

View user profile

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  ngalemmo on Mon May 17, 2010 12:08 pm

Either approach will work, but I am a bit confused over you comment about a natural key.

If your source is an ERP system, they usually maintain addresses in a separate, normalized structure. Such structures have a primary key made up of the customer ID and an address ID. If you combine customer and address in the same dimension, it follows your natural key would be the customer and address ID from the source. This, of course, does raise an issue if you get transactions that only reference a customer rather than a customer and an address.

I tend to break address out, particularly if the company is a wholesaler/manufacturer or a distributor. Address is very important in the event of a recall. If a customer's address changes, it is very important to retain the true address of the shipment at the time it was made so that product can be traced properly. This type of requirement tends to favor a separate address dimension.

If your feed only provides an address and no other source system key, use a hash of the address text (such as a 32 bit CRC) as a non-unique natural key and create an index on the hash, rather than an index on the address itself. When you do a surrogate key lookup, compare on the hash as well as the address text to locate the proper row. This is a good trade-off between a very large unique index (based on the address text) and lookup efficiency. A table with 10's of millions of addresses will only have, at most, 4-5 duplicate hash values (and most of the time only 1 or 2). So the hash narrows down the search to a handful of rows using a very small index.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  mea0730 on Tue May 18, 2010 1:13 am

Thanks for all the great responses! I'm leaning toward keeping them seperate dimensions for easier type 2 SCD maintenance!

By the way, for dimensions with less than 1 million rows, is it still recommended to use a surrogate lookup table? Wouldn't it be much simpler to lookup the natural key (inner join on natural keys between source staging table and dimension table) in the dimension table to get the surrogate key for loading into the fact table? Is the use of surrogate lookup tables common?

Sorry for all the "newbie" questions!

Thanks,
Mike

mea0730

Posts : 21
Join date : 2010-05-15
Location : San Jose, CA

View user profile

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  ngalemmo on Tue May 18, 2010 12:46 pm

mikea730 wrote:Thanks for all the great responses! I'm leaning toward keeping them seperate dimensions for easier type 2 SCD maintenance!

By the way, for dimensions with less than 1 million rows, is it still recommended to use a surrogate lookup table? Wouldn't it be much simpler to lookup the natural key (inner join on natural keys between source staging table and dimension table) in the dimension table to get the surrogate key for loading into the fact table? Is the use of surrogate lookup tables common?

Sorry for all the "newbie" questions!

Thanks,
Mike

I've not seen a separate surrogate key lookup table, so I would say they are not common. You usually just do a lookup on the natural key in the dimension table itself. Now, in the case of ETL tools, you can, and usually do, define a lookup object which queries the dimension table to get the natural key/surrogate key pairs and cache them in memory in order to do surrogate key assignments. But these are not separate structures in the database.

In the case of SQL based ETL (i.e. PL/SQL based code), you could create an index that includes the natural key and surrogate primary key, in that order, to essentially create a lookup structure (the index) that contains the information you need to efficiently assign surrogate keys. But it doesn't require a separate table.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  cridal on Fri May 28, 2010 11:58 am

"use a hash of the address text (such as a 32 bit CRC) as a non-unique natural key and create an index on the hash, rather than an index on the address itself. When you do a surrogate key lookup, compare on the hash as well as the address text to locate the proper row. This is a good trade-off between a very large unique index (based on the address text) and lookup efficiency. A table with 10's of millions of addresses will only have, at most, 4-5 duplicate hash values (and most of the time only 1 or 2). So the hash narrows down the search to a handful of rows using a very small index."

I must say this is an excellent suggestion that combines pragmatism and elegance...

cridal

Posts : 9
Join date : 2009-03-27

View user profile

Back to top Go down

How could and Address DIM be an SCD?

Post  Skipjacker on Mon Apr 30, 2012 9:47 pm

Mike, you said you wanted to keep them separate to make easier type 2 SCD maintenance. I'm not sure that address, by itself, could be SCD. Building don't usually move and if they do get relabeled by the Post Office, you probably won't care about the old address since it doesn't exist anymore. I know the NC-SC boarder just got clarified and some SC addresses are now in NC... but how often does that happen? ;-)
avatar
Skipjacker

Posts : 16
Join date : 2010-11-10
Age : 49
Location : Baltimore, MD

View user profile

Back to top Go down

Re: De-normalizing Customer Information to create a Customer Dimension

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum