A fact table surrounded by a lot of references tables

revdpoel Fri Jun 11, 2010 9:20 am

I am reviewing a design which is about to be implemented after build.

There is a fact table with 17(!!) foreign keys. 14 of these keys are two-letter codes pointing to reference-tables (as we call them). A reference table consist of codes with descriptions of a certain area.

To diminish the foreign-keys I want to introduce some kind of a dimension (let's call it dimA), which contains 14 times a code + description and a surrogate key. This surrogate key will replace the 14 foreign keys in the fact table.

Question 1: Is this a good solution or are there other ways?

I will populate dimA while running. Meaning if a combination of the 14 codes isn't in dimA, I will add the combination and generate a new surrogate key, which will be placed in the fact table.

But what if the source system is adding a new area with new codes, which would mean a new reference table in the old situation. The old facts have to point to the dimA-surrogate key with 14 codes, the new facts to the dimA-surrogate key with 15 codes

Question 2: How would you handle this?

Thanks Ron

BrianJarrett Fri Jun 11, 2010 10:16 am

If you feel that you're really only going to have 14 codes for the foreseeable future and your query usage analysis reveals that only a few codes are used together in a typical query than I'd just add 14 surrogate keys to the fact table and have each of them pointing to distinct code dimensions. You could put each code in its own dimension physically or you could push all 14 codes into the same table and then alias it 14 times against the fact table (what Kimball calls "role playing"). Aliasing is a good solution if you have savvy users or a semantic layer in your reporting tool that can do the aliasing for them.

The reason I'd push them into a dimension is because they have descriptions associated with each code. If they didn't, then the codes are degenerate dimensions and should go right into the fact table.

I wouldn't get too worked up over the 17 surrogate keys; we have a fact table in our warehouse with over 50 surrogate keys. True it's a centipede fact, but our usage patterns on this table result in queries that typically only return one or two codes at a time. Although theoretically a user could join 50 tables in one query they never do, so the exorbitant number of surrogate keys never poses an issue for us. Your use, obviously, could vary.

Pushing these into a single dimension with 14 columns (plus description columns) effectively creates a collection of codes. This can help reduce the number of surrogate keys in your fact table but would have to be altered for each new code, as you mentioned. Personally I'd shy away from this if I could since it's not very dynamic. In my opinion it's easier to alter the fact table and alias a new code dimension to join it to.

There is a way using this method that you could string these codes together into a single field and get around the inefficient code parsing issue but I won't confuse things by going into it unless you want me to. Simple is almost always better when it comes to dimensional modeling, at least in my opinion.

revdpoel Fri Jun 11, 2010 11:29 am

Hai Brian

in all queries the descriptions of all codes are used
role-playing isn't a good idea in my opinion, because the first code mght have a code of 4 positions alphanumeric, the second 2 numeric etc. Very different, also in meaning.

ngalemmo Fri Jun 11, 2010 11:49 am

revdpoel wrote:Hai Brian

in all queries the descriptions of all codes are used
role-playing isn't a good idea in my opinion, because the first code mght have a code of 4 positions alphanumeric, the second 2 numeric etc. Very different, also in meaning.

First off, you should be using surrogate keys, which makes the actual form and value of the codes irrelevent.

You could combine groups of codes together into junk dimensions to reduce the number of foreign keys if you need to. Group codes that have a strong correlation with each other to keep the size of the dimension tables manageable. Do not junk conforming dimensions.

BrianJarrett Fri Jun 11, 2010 12:47 pm

ngalemmo wrote:
revdpoel wrote:Hai Brian

in all queries the descriptions of all codes are used
role-playing isn't a good idea in my opinion, because the first code mght have a code of 4 positions alphanumeric, the second 2 numeric etc. Very different, also in meaning.

First off, you should be using surrogate keys, which makes the actual form and value of the codes irrelevent.

Absolutely. I'd incorrectly assumed this was already being done. With the possible exception of a date dimension all dimensions should be using meaningless surrogate keys.

ngalemmo wrote:You could combine groups of codes together into junk dimensions to reduce the number of foreign keys if you need to. Group codes that have a strong correlation with each other to keep the size of the dimension tables manageable. Do not junk conforming dimensions.

Great idea.

revdpoel Fri Jun 11, 2010 1:31 pm

Junk dimension is what I thougth, they have a strong correlation

But what if the source system is adding a new area with new codes, which would mean a new reference table in the old situation. The old facts have to point to the dimA-surrogate key with 14 codes, the new facts to the dimA-surrogate key with 15 codes

Question 2: How would you handle this?

ngalemmo Fri Jun 11, 2010 2:13 pm

That's the beauty of surrogate keys. You can always extend a dimension without affecting earlier facts. In the case of junk, you obviously would not have a value for the new code with old facts. It will also cause new dimension rows to be created since none of the old rows would have the new code.

You need to take this in consideration when designing these junk dimensions. I would suggest you do some analysis before you throw all 14 codes into one table.

BrianJarrett Fri Jun 11, 2010 3:10 pm

As long as the profile of the codes is the same you could still put all these codes into a single table, they'd just need to be distinct rows. Your table would look something like this:

CODE_DIM_KEY	CODE	CODE_DESCR
-1	none	no descr
1	abc	some code desc
2	aae	some code desc
3	sss	some code desc

Then alias the dimension table as many times as you need for each code in your fact row. Provided the amount of codes won't grow substantially in the future this is simple and easy. Get a new code? Alter your fact table to add the new surrogate key and then set the value for existing records to -1 for that new code. New facts will get the new code dimension surrogate key going forward.

True, you're aliasing a lot but your aliased dimension tables are small so it really shouldn't negatively affect performance, even with a bunch of joins. This is the role-playing I was talking about.

ngalemmo Fri Jun 11, 2010 3:23 pm

Brian,

My take is he has 14 different sets of codes, not 14 codes.

revdpoel Fri Jun 11, 2010 4:14 pm

Yes, you are right
14 different sets of codes

and 3 dimensions, meaning 17 foreign keys
I find that a lot, that's why I want to diminish it by using a junk dimension (somewhere i read the term theme-dimension, which is beautiful cause the codes are correlated)

The only problem is then what happens when in the future an extra set of codes is added. The theme-dimensions gets two new columns (code and description) which are of no use for the old facts

hang Fri Jun 11, 2010 8:13 pm

ngalemmo wrote:That's the beauty of surrogate keys. You can always extend a dimension without affecting earlier facts. In the case of junk, you obviously would not have a value for the new code with old facts. It will also cause new dimension rows to be created since none of the old rows would have the new code

I think ngalemmo just addressed your question. Junk dimension is an elegant and effective approach to avoid creating many small dimensions which is the sign of centipede fact. I know calling it junk is a bit unfair, but since Kimball started it, we really don't have other choice when we talk about it.

Think of the junk dimension as a way of denormalising many dimension tables into one. When you need more attributes, you extend your junk dimension just as you would with other dimensions when you need additional fields. You may need an Unknown value to cater for the past fact records when adding new fields. The point is the model is more resilient to changes by avoid modifying the fact table structure and rebuilding the fact every time you need a new field. It also makes the fact table more normalised which is good for performance.

You don't have to put every field into junk dimension. Some high cardinality fields need to be in either other dimension if related, or even in a standalone dimension. Well what is high cardinality, anything above 10 can make a junk dimension much bigger when combined with other attributes. To me, hundreds of thousands records for a dimension is Ok, millions may be too big for an average dimension, especially for a junk dimension. Very high cardinality field in the same order of magnitude of fact should be classified as degenerate dimension.

BrianJarrett Mon Jun 14, 2010 9:34 am

revdpoel wrote:Yes, you are right
14 different sets of codes

and 3 dimensions, meaning 17 foreign keys
I find that a lot, that's why I want to diminish it by using a junk dimension (somewhere i read the term theme-dimension, which is beautiful cause the codes are correlated)

The only problem is then what happens when in the future an extra set of codes is added. The theme-dimensions gets two new columns (code and description) which are of no use for the old facts

Sorry if I wasn't getting the point across right; I was actually referring to 14 sets of codes, one dimension surrogate key for each set. One point I was making is that most likely all the codes could still live in the same table, provided the profile of the data is the same for all codes, and just aliased 14 times. Otherwise you'd have 14 different code dimension tables; one per surroagate.

This starts to fall apart if many more sets of codes are introduced. It wouldn't make sense to have 100 surrogate keys in the fact table; junk dimension(s) would be a better solution.

Sponsored content

A fact table surrounded by a lot of references tables

A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables

Re: A fact table surrounded by a lot of references tables