Snowflaking Dimensions

Sideout72 Wed Feb 09, 2011 3:51 pm

I've got two related questions.

Question 1: Building an insurance data warehouse, you have a DimPolicy table (for the insurance policy) and a FactPolicyTransaction table (for the transactions on that policy). There is an insurance agent that is responsible for the policy as a whole, and an insurance agent (which may be a different person) that is responsible for the transaction on that policy. There are a large number of attributes about the agent, which leads me to believe that putting all of the attributes in DimPolicy for the agent (where the agent is responsible for thousands of policies) doesn't make sense - therefore, you snowflake the agent, and create a DimAgent table which has it's PK as a FK within DimPolicy. Now you can tell which agent is responsible for the policy. Now, you also need a relationship between DimAgent and FactPolicyTransaction to denote which agent actually made the policy transaction. This creates a "triangle", since DimAgency has its key in both DimPolicy and FactPolicyTransaction, and DimPolicy also has its key in FactPolicyTransction. The question - is this "triangle" ok? It seems awkward to me, and against the philosophies of dimensional modeling.

Question 2: You have a company (brokerage), and an agent works for a company. There are 30+ attributes for the brokerage, and 20+ attributes for an agent. It seems funny to create a DimAgent dimension which includes all 30 of the brokerage's attributes repeated over and over again. By doing that, you will constantly need to run a "SELECT DISTINCT" on the DimAgent table to get a valid list of brokerages (since they are repeated over and over again). Is this a prime time to snowflake and create a DimBrokerage dimension as well? And if so, how would this relate to question 1, where we might have already snowflaked the DimAgent dimension once...

Thanks in advance for your help.

BoxesAndLines Wed Feb 09, 2011 7:05 pm

1. That's one way to do it. The other would be to create a factless fact table to manage the relationships between Agents and Policies. The latter is dimensionally more sound. As long as you don't get carried away, the occasional relationship between dimensions is acceptable. A common one you'll find in my models are relationships from a dimension to the Date dimension.

2. See answer 1.

Don't get caught up in replicating the data. Even though data is repeated, in the grand scheme of it, the dimensions take up a nominal amount of space.

ngalemmo Wed Feb 09, 2011 7:28 pm

For #1 you have an agent dimension, not agent attributes on the policy dimension. For facts associated with the policy you have two agent FKs off the fact, one for the agent related to the policy and the other for the agent related to the transaction. To make ETL simpler, you can carry the NK or PK of the policy's agent on the policy dimension, so you can easily place it in the fact table. The value on the policy dimension would not be used by the end user to join to the agent, so it is not a snowflake. There is no reason for a snowflake.

As far as # 2 goes, I usually maintain small list of value (LOV) tables in addition to the normal dimension table. The LOV tables are not for general use, but rather to support BI tool functionality and change detection in the ETL process. One the BI side, most reporting tools allow for drop down lists to select things. The LOV tables provide an efficient means to implement such lists. On the ETL side, it provides an easy way to check to see if a description has changed. I usually update the table from the source and have a trigger to check for change. If a change is detected, the trigger updates the description in all the dimension tables where the particular code and description appear.

Jeff Smith Wed Feb 09, 2011 7:38 pm

For question 1, if you need to have the Agent that wrote the policy as well as the agent that handled the transaction, then create 1 Agent Dimension but put 2 Keys on the transaction fact table, one for Policy Agent and the other for Transaction (or claim?) Agent. However, this will give you the Agent of the policy at the time of the transaction. If you have a situation where a Policy can be "owned" by different Agents over time and you needed to know the current owning Agent, then you could put the Agent Dimension Key on Policy Dimension and call it "Current Agent".

You could do both, which would be handy if you wanted to aggregate "claim amount" by each owning agent and compare it to the Premium by each owning agent to calcualte a loss ratio. But you could do the same aggregation by current agent if you wanted to restate history based on how the world currently looks (handy when setting baselines for next years goals).

On Question 2, this is where the art comes in. You could create 2 dimensions. if you needed to know the agents and their company relationship, create a factless fact table with the surrogate keys from the 2 dimensions. Or, you could create 1 dimension table and make Company a roll up point, creating a surrogate key based on Company. If you needed to create an aggregate table to the Company, then aggregate it to the Company surrogate key and then create a view of the Dimension Table where you select distinct on the columns that are pertinent to the Company. This view can be a drag if your Agent Dimension is really long and the combined Agent and Company info in a long Agent Dimension could take up a large amount of space. But the up side is that one dimension reduces the number of columns on your fact table.

It all depends on how it's used and the impact on performance.

Sideout72 Thu Feb 10, 2011 11:23 am

All of your responses are greatly appreciated. In the end, all the responses brought up the various ways I thought of to solve the design questions, and I wasn't sure which way to go. As always, the answer to many questions is "it depends".
The various solutions all make perfect sense, and I'm going use a combination of them in the end. To summarize what everyone suggested (and paraphrased somewhat):

1. Put the DimAgent key into the Fact table, one for the "TransactionAgent" and one for the "PolicyAgent".
2. Add the DimAgent key into the DimPolicy table, in effect snowflaking.
3. Use a factless fact table to model relationships between dimensions when you don't want to go through a "factfull" fact table to get there (especially those cases were there are relationships, but no fact transactions yet).
4. Create views on the dimension tables for various rollups to get you "lists" of select criteria data
5. Creation of a "list of values" table to support homegrown reporting solutions and simplify change detection.

In addition, the best advice I've given to anyone is (and it was mentioned by BoxesAndLines) - don't worry about the denormalization of data within a dimension because of disk space - worry about modeling the data correctly and for performance.

The only piece I somewhat disagree with is this: I can't (completely) think about what BI tools are going to be put on top of the DW and what features they will support. I'm going to solve the modeling problem - BI tools come and go over time.

Sponsored content

Snowflaking Dimensions

Snowflaking Dimensions

Re: Snowflaking Dimensions

Re: Snowflaking Dimensions

Re: Snowflaking Dimensions

Re: Snowflaking Dimensions

Re: Snowflaking Dimensions