Potential 20+ Billion Row Fact Table

View previous topic View next topic Go down

Potential 20+ Billion Row Fact Table

Post  meb97me on Wed Jun 19, 2013 5:38 pm

As part of our current Data Warehouse project one of the FACT tables we're planning on creating is a daily snapshot of all our customer accounts (10 million+)

Now 10million records snapshotted daily = 3.6 billion records a year (and based on the width of the fact table that equates to about 130GB data a year excluding indexes) and we have 6 years worth of data we want to load so that equates to 21.9 billion records so therefore ~1TB!

The DBAs were a bit shocked by these figures for a single table and were asking why we couldn't use effective from/to dates for the fact table as obviously all the accounts are necessarily going to change each and every day , so there will be a lot or duplication.

I just wondered what others have done with data on this scale. I've got to admit this is bigger than previous implementations I've worked on.

cheers

meb97me

Posts : 34
Join date : 2010-07-28

View user profile

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  ngalemmo on Wed Jun 19, 2013 8:59 pm

Well, it depends on you DBMS.  With Oracle you would use partitions to help things out.  Netezza you don't worry too much about it.

Talk with the business.  A 6 year old balance by day probably isn't of much value.  See if you can limit the number of days you need to keep, then create month-end numbers for the old stuff.  It should cut down the volume considerably.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  Mike Honey on Thu Jun 20, 2013 12:08 am

Hi meb97me,

It sounds like your requirement could be met using the Kimball design pattern "Slowly Changing Dimension Type 2" e.g. http://www.kimballgroup.com/2008/09/22/slowly-changing-dimensions-part-2/ 

This complicates the ETL design (compared to a raw snapshot), but does require less data space.  For example a Customer whos SCD2 attributes never change is represented by one row for all time.

Queries involving related Fact tables actually become simpler and very fast, as your ETL process assigns the relevant Customer row to each Fact via the Customer Surrogate Key, then you can just join using that.

Good luck!
Mike
avatar
Mike Honey

Posts : 185
Join date : 2010-08-04
Location : Melbourne, Australia

View user profile http://www.mangasolutions.com

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  umutiscan on Thu Jun 20, 2013 3:16 am

What do you mean with "a daily snapshot of all our customer accounts" ? Is this a snapshot including all account information?

If this is a snapshot for all account attributes, DB admins are right. You need to create an account dimension to store dimnensional attributes like account type, creation date, owner etc.

But you cannot hold everything with start/end dates. You also need a snapshot fact to hold facts (like EOD balance amount) because business needs trend, graph or average them over time. They can not easily trend them if you model EOD balance amount in a dimension. This is the first reason why you should have a fact table. On the other hand EOD balance is usually a rapidly  changing attribute, so it has to be taken out of the SCD. This is the second reason.

I also agree with Ngalemmo. I don't think that your business really needs a 6 year old daily snapshot data.

umutiscan

Posts : 51
Join date : 2010-11-26
Age : 36
Location : Istanbul, Turkey

View user profile

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  meb97me on Thu Jun 20, 2013 3:39 am

Sorry yeah i probably didn't give enough details

We are already implementing SCD2 dimensions so the account dimension(and the others) will be slowly changing.

The daily shapshot FACT table would capture the relationship of the account to the other dimensions at the end of the day such as Status, Account Manager, Portfolio and obvioulsy metrics such as account balance etc are liable to change on a daily basis. 

The business are interested in knowing and keeping this historical view of the accounts as the modelling teams use the data to model how accounts have performed/changed over time and use this information to help them model/predict potential new portfolios of accounts that they may purchase, so understanding the lifecycle of these accounts is important and rolling up the snapshot to a week or month would lose the all important granularity.

We're using MSSQL so partitioning is certainly an option

meb97me

Posts : 34
Join date : 2010-07-28

View user profile

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  hkandpal on Thu Jun 20, 2013 1:15 pm

Hi,

when you do the historical reporting, are you going to chek/compare the data for the  past 6 years for an account or you may check year/quaterly/monthly/daily ?
Could you please tell how will your report be for a account which changed 20 times in a month and for other which did not changed in a year.

thanks

Himanshu

hkandpal

Posts : 113
Join date : 2010-08-16

View user profile

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  umutiscan on Fri Jun 21, 2013 1:27 am

meb97me, why do you track dimension relations in a daily snapshot? You can create factless facts to track the relations and you insert a new row when any relation changes or any new relation occurs. This will reduce the data size.

You have to continue to hold the metrics in a fact, but as I said before you need to limit the number of days to be held in your fact table.

umutiscan

Posts : 51
Join date : 2010-11-26
Age : 36
Location : Istanbul, Turkey

View user profile

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  Abhiraizada on Tue Jun 25, 2013 1:57 am

I suggest keeping daily snapshot for all accounts (even if most of them have no activity every day) will cause huge data explosion. Would recommend keep snapshot for accounts who have activity that day and try to handle it on reporting end.

Otherwise go ahead with daily snapshot for all account and partition it on monthly basis.

Abhiraizada

Posts : 20
Join date : 2011-05-24

View user profile

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  sgudavalli on Tue Jun 25, 2013 3:19 pm

meb97me wrote:As part of our current Data Warehouse project one of the FACT tables we're planning on creating is a daily snapshot of all our customer accounts (10 million+)

Now 10million records snapshotted daily = 3.6 billion records a year (and based on the width of the fact table that equates to about 130GB data a year excluding indexes) and we have 6 years worth of data we want to load so that equates to 21.9 billion records so therefore ~1TB!

The DBAs were a bit shocked by these figures for a single table and were asking why we couldn't use effective from/to dates for the fact table as obviously all the accounts are necessarily going to change each and every day , so there will be a lot or duplication.

I just wondered what others have done with data on this scale. I've got to admit this is bigger than previous implementations I've worked on.

cheers

Hi meb97me,

if the data analysts do require the daily snapshots of the data to generate predictive models then we cannot run away from it. 
and I don't think Storage space is an issue these days; 
yes but you need to definetly think about how data engineers are going to retrieve it. partitioning is a must as the size of data keep on growing in a consistent manner in your case.

As one of the folks suggested you can check if your analysts are okay with a monthly/weekly/bi-weekly snapshot of account facts instead of daily snapshots. This way you are trying to reduce the magnitude of data.

yes; you can very well maintain effective from and to dates i.e.. 
lets say you got an account and in a month it got only 2 transactions; you can pretty much manage having 2 records on the day when transaction happened instead of having 30 records; but its a work on ETL and is okay for the very first time you are setting up your DW.

Regards
Shiv

sgudavalli

Posts : 29
Join date : 2010-06-10
Age : 32
Location : Pune, India

View user profile

Back to top Go down

Re: Potential 20+ Billion Row Fact Table

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum