Space savings in Star Vs 3NF

Jeff Smith Thu Mar 25, 2010 2:11 pm

I was curious if anybody had any estimates of the savings in space between a 3NF database and the same Data in a Star Schema? If a 3NF database took up 2.5TB, would the Star be 1.25TB? I was just wondering if there was a rule of thumb about it.

In looking at the big tables in the 3NF, I see a lot of fields that can be grouped together in a junk dimension, plus they are using a lot of Datetime fields that I would replace with the a surrogate Key from my Date Dimension.

ngalemmo Thu Mar 25, 2010 3:51 pm

Propronents of a 3NF would argue just the opposite... a fully normalized database uses less storage than a dimensional model.

But the bottom line is, what difference does it make? With a TB of high-end disk storage costing well under $2,000, disk consumption is not an issue. A typical office burns up a lot more than that e-mailing pictures of kittys and incomprehensible PowerPoint presentations to co-workers.

BoxesAndLines Thu Mar 25, 2010 10:47 pm

I agree with Nick. You have your numbers backwards. Space savings is not a side effect of dimensional modeling. Fortunately, the fact table, which is the most voluminous table should be 2NF. Should is the key word. I see many fact tables cluttered with dimension columns because joins are too expensive or the dimensions are type 1 and I can't be bothered building a type 2.

Jeff Smith Fri Mar 26, 2010 12:53 pm

I agree that disk is cheap but I'm not the one setting policy about the type of disk, the need for RAID 5, the need for Prod, test, dev. A 3 TB database can easily require 12 TB of disk. If you are using high performance SAN, the cost per TB can be around $20K including additional memory, and before you know it, you're approaching $250K. I don't think it's a lot of money, but then again I think RAID 5 in a DW environment is a waste of good disk.

I disagree about the space savings. There's plenty of stuff in a 3NF that can be put together in junk dimensions. Wasn't it Kimball that said more than 11 dimensions per fact is too many, suggesting the fact table should be no more than 44 bytes plus measures. I was looking at a table in a 3NF database that had 7 datetime fields and 9 Flags/Indicator fields. Assume each flag is 1 Byte - that's (7 *8) + 9 = 65 bytes. Replace the dates with Keys to a date dimension and put the Flags in a junk dimension and each row is (7*4)+4 = 32 bytes - I've cut the size of the table in half. Granted, I get little savings from measures and the dimensions take up space as will aggregate tables but there are some small hierarchies that can be combined into a single dimension and some times its possible to put little used date fields into a dimension table that snow flakes to the date dimension (appropriate in some instances). There's also the issue of VARCHAR in the source and CHAR in the Dimension tables not to mention space ofr indexing (I have no idea how a 3NF transaction system deals with indexing)

I was just trying to get a ballpark estimate of the amount of disk I would need based on the size of the source database. I realize that I could get pretty good estimates once I started the initial modeling but was wondering if there was a rule of thumb about these things.

ngalemmo Fri Mar 26, 2010 2:06 pm

What we are saying is: choose a modeling method based on its applicabilty to the problem at hand, not on minor technical issues. Data warehouses are built for end-users (i.e. the business). Deliver a solution they can easily understand and use. The advantage of dimensional modeling is the concepts fit the way business people think. I mean, Dr. Kimball didn't 'invent' dimensional modeling... he clarified/formalized the approach to achiving dimensional models in a relational database environment. Businesses have been using what we know as dimensional models for a very, very long time.

If you can deliver a solution to addresses a business need and one that delivers true value, cost is almost never the issue.

In terms of trying to figure out how much space you will need... take your best guess and multiply by 3. Then attach a note saying it would be re-evaluated each year.

Jeff Smith Fri Mar 26, 2010 5:45 pm

Cost is never an issue??? Where? That's where I want to work.

Cost is always an issue. The first word out any manager's mouth is, "what's the cost". If you need $250K worth of disk and they've budgetted $100K, then yeah, I think cost is going to be an issue. A key IT project is over budget, they will come looking to cut costs where ever they can.

Where have you ever been that cost wasn't an issue? And not only is cost and issue, so is politics.

You got 2 people arguing 2 different solutions. Of course DW is the way to go. But most senior IT managers don't know a fig about DW. One person says, hey, let's build an ODS in 3NF and have real time data. They say, "reporting is easy, we can create views that do the same thing as a dimensional warehouse". I know it's the wrong thing to do, but in the real world, the best solution isn't always implemented.

But thanks for the advice about the best guess and multiplying by 3. Real helpful. Appreciate it. I guess saying, "gee, I don't know" would have been too humbling.

Sponsored content

Space savings in Star Vs 3NF

Space savings in Star Vs 3NF

Re: Space savings in Star Vs 3NF

Re: Space savings in Star Vs 3NF

Re: Space savings in Star Vs 3NF

Re: Space savings in Star Vs 3NF

Re: Space savings in Star Vs 3NF

Re: Space savings in Star Vs 3NF