Large Degenerate Dimension Values

View previous topic View next topic Go down

Large Degenerate Dimension Values

Post  Dave Jermy on Mon Oct 03, 2011 6:10 am

I am designing a new dmensional model for an insurance company and users want to analyse the documents (stored in sharepoint) that have been attached to various entities (policies, claims etc.)

I have designed a Document dimension to hold some of the document metadata (document type, file type, document source). This is a relatively small dimension with just a couple of thousand rows. The fact table (grain: 1 row per change to a document's metadata) has FKs to the Date (for created date), Employee (for creator), Policy and Document dimensions and a small number of measures (count, filesize etc.)

This leaves me with a couple of attributes to fit into the model: Document Title and Document URL. The second is unique across all documents and the first is almost so, so my first thought was to put them onto the fact table as degenerate dimensions.

However, they are very long varchar columns (255 and 2000 characters respectively). Looking at the existing documents, the average combined length of the values in these columns is 157 characters (min 99, max 287). Given that the fact table is quite slender (<30 bytes), adding them seems wrong and may impact on query performance. But, I don't want to add them into the Document dimension, either as it would then be almost 1 to 1 with the fact table.

Any suggestions on the best way to model this?

Dave Jermy

Posts : 33
Join date : 2011-03-24
Location : London, UK

View user profile

Back to top Go down

Re: Large Degenerate Dimension Values

Post  BoxesAndLines on Mon Oct 03, 2011 10:42 am

You've identfied the two methods. Do some volume testing to see which works best. If the performance is similar go with the dimension solution since it is cleaner.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Large Degenerate Dimension Values

Post  VHF on Mon Oct 03, 2011 12:06 pm

Because your fact table represents an instance of a document, it seems like that is where Document Title and Document URL should go as DD. I suspect in this case the majority of usage of this fact table will involve at least one of these fields, so there isn't a lot of performance savings to be gained from trying to move them elsewhere. If most of the useage of this fact table involved aggregating other measures, then it might be worthwhile trying to relocate the long DD columns to keep the fact table narrow.

VHF

Posts : 236
Join date : 2009-04-28
Location : Wisconsin, US

View user profile

Back to top Go down

Re: Large Degenerate Dimension Values

Post  ngalemmo on Mon Oct 03, 2011 2:39 pm

As VHF hinted, it becomes a matter of how often these values are appear in queries. Adding them as DD's, even if it is a document fact table, will severly impact performance. Adding 150+ bytes to a < 30 byte row will have a dramatic hit to I/O. If 80% of the queries need the data, DD's may be the correct option, otherwise a separate dimension, even if the cardinality is high, would be the better choice.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Large Degenerate Dimension Values

Post  Dave Jermy on Tue Oct 04, 2011 4:21 am

Thanks for the responses guys - I'll dig into the requirements a little deeper and do some performance testing.

Dave Jermy

Posts : 33
Join date : 2011-03-24
Location : London, UK

View user profile

Back to top Go down

Re: Large Degenerate Dimension Values

Post  Mike Honey on Tue Oct 04, 2011 6:58 pm

Hi Dave,

Based on your description of the Fact table grain, I'd definitely go for storing them on the Dimension. I imagine there will be several Fact records for each Dimension record so you would be duplicating data if you put those attributes on the Fact. It would also be much easier to manage changes to those attributes if they are on the Dimension.

Good luck!
Mike
avatar
Mike Honey

Posts : 185
Join date : 2010-08-04
Location : Melbourne, Australia

View user profile http://www.mangasolutions.com

Back to top Go down

Columnar-Oriented Database Eliminates Inefficiency of Columns Not Used by a Query

Post  Alan Musnikow on Thu Oct 06, 2011 5:06 pm

This issue demonstrates a benefit of a columnar-oriented database. Adding the columns to the fact table in a columnar-oriented database, such as Infobright, would impose no performance penalty on queries that do not use the columns.

In addition, Infobright would very heavily compress these columns.

In other words: "Adding 150+ bytes to a < 30 byte row will have" no impact on I/O in a columnar-oriented database for queries that do not use the added columns.
avatar
Alan Musnikow

Posts : 6
Join date : 2010-06-17
Location : Lexington, Massachusetts, U.S.

View user profile http://musnikow.com/

Back to top Go down

Re: Large Degenerate Dimension Values

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum