How to handle Large dimensions 35M records

View previous topic View next topic Go down

How to handle Large dimensions 35M records

Post  wizard on Wed Dec 08, 2010 11:36 pm

Hi all.
I have a dimension with 35+ million records as compared to 225 Million in the fact table. I am not sure about the kind of penalty we will have to pay for such a big dimension. Here is summary regarding this dimension.

This dimension is derived from a transactional table and provides information about the operator entering the data. because this dimension provides name of operator for each customer transaction entered, it needs to be at same granularity as transaction table. we will need to report of this dimension so cant not make it part of fact table.

Is there any specific way of handling such a huge dimension?

Thanks

wizard

Posts : 13
Join date : 2010-11-30

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  BoxesAndLines on Thu Dec 09, 2010 1:42 pm

wizard wrote:Hi all.
...This dimension is derived from a transactional table and provides information about the operator entering the data. ...

I don't suspect that you have 35M operators, so what other information are you storing in your operator dimension to cause so many rows?
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  wizard on Thu Jan 06, 2011 10:45 am

It does also provide transaction_id whcih is the natural key and transaction type. There are only 36,000 distinct operators.
I need to include the natural key this is a SCD.

Thanks

wizard

Posts : 13
Join date : 2010-11-30

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  Jeff Smith on Thu Jan 06, 2011 11:38 am

Why do you need to include the Transaction ID in the Operator dimension? Why not make the Transaction ID a degenarate Dimension on the fact Table?


Jeff Smith

Posts : 471
Join date : 2009-02-03

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  wizard on Thu Jan 06, 2011 1:06 pm

Let me give some more detail about this dimension. This dimension has 10 columns, transaction_id being the natural key. It's not an operator dimension rather operator is just an attribute is this dimension. Dimension gives more detail about the transactions like status, place it was performed ect.
The reason for having transaction id is that it uniquelly identifies each row and will be reuired if any attribute needs to be updated.

Fact table is more granual than this dimension.

Thanks

wizard

Posts : 13
Join date : 2010-11-30

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  hang on Thu Jan 06, 2011 8:58 pm

30+ million records is definitely a ridiculous number for a dimension that has only 36,000 distinct members. I guess there would be a huge number of redundant information in the table. Although we say dimension table should be de_normalised in dimensional modeling, it does not mean we can turn a dimension into a something like a fact.

In dimensional modeling, there is a famous technique called Slowly Changing Dimension (SCD) which can be used to eliminate all the unnecessary redundant data. In essence, the way SCD works is to define a number of attributes as SCD2 if their changes need to be tracked. Many attributes can be defined as SCD1 if their changes can be ignored and overwritten by the latest settings. So on average, if each operator has its SCD attributes changed 5 times throughout the history, there would be only around 180,000 records in the operator dimension.

In your original design, you would find many operators records have been copied into the dimension numerous times even if they have not changed. This snapshot style of keeping history records is very popular in OLTP system and that's one of main reasons why we need SCD2 in BI systems.

BTW, your Transaction_id is definitely a degenerate dimension within the fact table as Jeff said, and you should not use it in any dimension to tag the attribute changes. Instead, you use surrogate key for that purpose.

hang

Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  wizard on Thu Jan 06, 2011 9:35 pm

hang wrote:30+ million records is definitely a ridiculous number for a dimension that has only 36,000 distinct members. I guess there would be a huge number of redundant information in the table. Although we say dimension table should be de_normalised in dimensional modeling, it does not mean we can turn a dimension into a something like a fact.

In dimensional modeling, there is a famous technique called Slowly Changing Dimension (SCD) which can be used to eliminate all the unnecessary redundant data. In essence, the way SCD works is to define a number of attributes as SCD2 if their changes need to be tracked. Many attributes can be defined as SCD1 if their changes can be ignored and overwritten by the latest settings. So on average, if each operator has its SCD attributes changed 5 times throughout the history, there would be only around 180,000 records in the operator dimension.

In your original design, you would find many operators records have been copied into the dimension numerous times even if they have not changed. This snapshot style of keeping history records is very popular in OLTP system and that's one of main reasons why we need SCD2 in BI systems.

BTW, your Transaction_id is definitely a degenerate dimension within the fact table as Jeff said, and you should not use it in any dimension to tag the attribute changes. Instead, you use surrogate key for that purpose.

Thanks for detail reply. If I move the transaction_id into fact table then how will I update any attribute i.e if the status for a particual transaction changes from in-process to completed in the source, how do i change same in dimension if I dont have a natural key to distinguish each row.

Again my questions might be super easy for you and not make much sense but bear with me as I am not to Data warehousing.
Secondly what advantage to I get by moving it to Fact table.

wizard

Posts : 13
Join date : 2010-11-30

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  ngalemmo on Thu Jan 06, 2011 9:55 pm

Let me give some more detail about this dimension. This dimension has 10 columns, transaction_id being the natural key. It's not an operator dimension rather operator is just an attribute is this dimension. Dimension gives more detail about the transactions like status, place it was performed ect.

If that is the case, then you should have dimensions such a operator, status, location, etc... with FKs on the fact. Transation ID is a degenerate dimension on the fact.

Benefit? Instead of a 35M row dimension table, you have 3 or 4 dimensions with maybe 100 rows or so. Much better performance. Plus, you may have conforming dimensions (such as operator and location) that you can use in other places.

If the nature of the transaction changes, update the fact.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  wizard on Thu Jan 06, 2011 10:03 pm

ngalemmo wrote:
Let me give some more detail about this dimension. This dimension has 10 columns, transaction_id being the natural key. It's not an operator dimension rather operator is just an attribute is this dimension. Dimension gives more detail about the transactions like status, place it was performed ect.

If that is the case, then you should have dimensions such a operator, status, location, etc... with FKs on the fact. Transation ID is a degenerate dimension on the fact.

Benefit? Instead of a 35M row dimension table, you have 3 or 4 dimensions with maybe 100 rows or so. Much better performance. Plus, you may have conforming dimensions (such as operator and location) that you can use in other places.

If the nature of the transaction changes, update the fact.

Thanks. Could you please explain, if the nature of transaction changes from in-process to completed how and what changes in fact.

wizard

Posts : 13
Join date : 2010-11-30

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  hang on Thu Jan 06, 2011 11:35 pm

If the transaction is in-process, why do you want to commit it into database anyway? You may keep checking the status or leave the data in staging area and load them only if they are completed.

In case you do want to measure how many transactions are in-process and how many completed, you can always have a status dimension and have FK in the fact table , or more elegant approach as ngalemmo suggested, combine it with other low cardinality attributes in a junk dimension. You may google out about using junk dimension.



hang

Posts : 528
Join date : 2010-05-07
Location : Brisbane, Australia

View user profile

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  ngalemmo on Thu Jan 13, 2011 12:15 am

Could you please explain, if the nature of transaction changes from in-process to completed how and what changes in fact..

There are three basic types of fact tables: snapshots, accumulating snapshots and delta tables. The latter two are used to maintain history of the fact.

The simplest is a snapshot fact which has no history and only contains the current state of the fact. In such a table, you simply update the fact row in place and change the particular FK reference.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: How to handle Large dimensions 35M records

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum