why having <200K rows in Profile Dim is ideal ?

VTK Wed Jan 04, 2012 6:16 pm

I know we say that less than 200K rows in profile/junk dimension is ideal but can anyone tell me the reason ?
Is it because of the indexes we need to create on this table ? I would imagine that we need to create a composite index on most of the columns in this table as we have to check all the columns to get the dim key ? Are there any other reasons ?

ngalemmo Wed Jan 04, 2012 7:50 pm

I have not heard of that as a rule, but it makes some sense, although the threshold will vary depending on your database system and configuration.

In the second edition of Toolkit, these types of dimensions are referred to as 'mini dimensions'. The issue is that smaller dimensions tend to perform better in star joins (easier to cache, lower PK cardinality, etc...). If you are building a junk dimension and its size is running out of control, it is usually an indication that there are too many uncorrelated or high cardinality attributes in the table. The usual solution is to reevaluate the design and break it down into more mini dimensions based on cardinality and/or correlation between the attributes.

As far as indexing goes, when you have a fairly large natural key, a common technique is to hash the natural key (such as using a 32 bit CRC function) and create a non-unique index on the hash (no index on the natural key itself). The hash would be used in the lookup, along with the natural key. The hash would narrow the search down to one or two rows, then scan the rows for a match on the natural key.

VTK Thu Jan 05, 2012 12:18 am

Code:: a common technique is to hash the natural key (such as using a 32 bit CRC function) and create a non-unique index on the hash (no index on the natural key itself).

Thanks for the reply. I am just wondering why do we need Natural key with CRC value ? Is not the CRC alone is enough ?

ngalemmo Thu Jan 05, 2012 4:07 pm

No. While the hash may generate 4 billion possible values, it is not immune from collisions when used in this manner. As the size of the table increases, there is a rapid increase in the possiblilty of getting the same hash for a different value. So, the hash can only be used as a non-unique index. Also, doing this is only worth the effort if you are dealing with a large table with a very large natural key. With a 32 bit hash , you start seeing collisions on a regular basis when you hit a few hundred thousand rows.

Note that you do not build an index on the natural key. You simply include it as a filter criteria.

VTK Thu Jan 05, 2012 4:46 pm

I am just trying to make sure that I understood you correctly.

I should be writing a query like this to get the Dim_ID.

Select Dim_ID(Surrogate Key) from Profile_Dim1 PD
where
PD.Hash_Value = Incoming_Hash_Value
and PD.NaturalKey1 = Incoming_Naural_Key1
and PD.NaturalKey2 = Incoming_Naural_Key2
and PD.NaturalKey3 = Incoming_Naural_Key3
and PD.NaturalKey4 = Incoming_Naural_Key4
and PD.NaturalKey5 = Incoming_Naural_Key5
etc...

and you are saying we just need to create Non Unique Index on the Hash_Value not on the Keys.

I appreciate your help.

Thanks

ngalemmo Thu Jan 05, 2012 5:00 pm

Basically. The hash value is calculated against a concatenation of the natural key elements. The idea is to have a much smaller index structure (the hash) that gets you to the row 99% of the time and a small scan the rest of the time.

But, if, based on the original post, you can split the table and significantly reduce the size of the individual junk tables, it's not worth the bother. It does work well with something like an address dimension, with 10's or 100's of million rows, or tracking search phrases from web logs, and so forth. You can achieve very significant space savings as well as improved query speeds for surrogate key lookups.

hang Thu Jan 05, 2012 9:56 pm

CRC function, or hash checksum, can take an arbitrary number of attributes and turn them into a single long integer value which is almost unique across billions of possible combinations. You may have an additional attribute in your junk dimension to store this value or calculate it on the fly by ETL, so instead of comparing all the attributes, you can just compare two integer values, like this:

Select Dim_ID(Surrogate Key)
from Profile_Dim1 PD
join Profile_Source PS
on hash(PD.attribute1,PD.attribute2...)=hash((PS.attribute1,PS.attribute2...)

You would need to use insert ... not exist SQL to maintain the junk dimension. In my experience, the hash values would likely produce clash for very large dimensions, in which case, if performance is critical, you would only use it to narrow down the search first, then do the comparison as suggested by ngalemmo. Otherwise, your brute force field-by-field comparison is just fine for most cases.

The point of using junk dimension is to denormalise as many as reasonably possible attributes into minimal number of dimension tables for the purpose of better performance and ease of use (less joins) at a little price of extra ETL process.

VTK Thu Jan 05, 2012 11:39 pm

Hang,
I understand the function of CRC or Checksum but I was trying to understand what ngalemmo said. I got it...
But my question is as general rule how many junk dimension can we have in a star schema ?
If we create a mini or junk dimension from main dimension and if that gets bigger do we break that into many mini dimension ? Then we would end up something similar to Normalised model...Correct ?

Thanks

hang Fri Jan 06, 2012 12:13 am

Yes and no. We combine many low cardinality attributes into minimal number of junk dimensions, so to reduce the number of dimension keys in the fact table to 2-3 (say from 20), and that's big saving on the fact table with reasonable breakup to keep reasonable size for each dimension table. It's not normalisation per se, but denormalisation to certain degree to control the table size.

With mini dimension extracted from a big main dimension, it is normally used to extract fast changing or low cardinality attributes into smaller dimensions and let fact table reflect the SCD correlations, which otherwise would be kept in the main dimension incurring fast growth.

When should we normalise or denormalise the model in dimensional modelling? I would say, denormalise most dimension tables but normalise fact tables and monster dimensions.

Sponsored content

why having <200K rows in Profile Dim is ideal ?

why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?

Re: why having <200K rows in Profile Dim is ideal ?