Kimball Forum
Would you like to react to this message? Create an account in a few clicks or log in to continue.

Very Large Dimension Strategy

5 posters

Go down

Very Large Dimension Strategy Empty Very Large Dimension Strategy

Post  tpolak Thu Jan 03, 2013 2:35 pm

Hi,

I have handled dimensions that were roughly 25 million rows before, but I have encountered a dimension that is about 150 million rows. The dimension is for Account Demographic information. Also, an account can have multiple addresses linked to it (one is the Primary, others can be employee, spouse, etc). I am able to incrementally grab the new or changed rows (which is roughly 100K rows a day). What I am starting to do now is I have a staging table that will hold the rows coming in. In the ETL for the dimension table, I was thinking of joining to the staging table with a LEFT OUTER JOIN and have a conditional split in SSIS - if the business key is NULL, then its a new row, if it exists, then it is an update. I have a clustered key on the business key (AccountID, DivisionID, and AddressType), but the issue I am encountering is during the insert of the rows. The package performs very good until it gets to the insert of new rows. I am thinking that it could be the clustered index.

My questions are: is there a better way to design my ETL and should I drop and recreate the indexes? I am thinking that it may be the best way, but it could be more time consuming to rebuild an index on a table with 150-200 million rows.

Thanks!

tpolak

Posts : 6
Join date : 2012-12-13

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  thedude Fri Jan 04, 2013 8:04 am

I would probably stay away from SSIS with those data volumes. Use it (SSIS) to move the the incremental data close to the target table then use sql to process the changes, in a stored proc is probably best. Sql server will handle this much better using set based operations.

thedude

Posts : 21
Join date : 2009-02-03
Location : London

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  thedude Fri Jan 04, 2013 8:06 am

You may also need to drop / disable indexes when doing a large insert

thedude

Posts : 21
Join date : 2009-02-03
Location : London

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  tpolak Fri Jan 04, 2013 9:27 am

Thanks for your reply!

Unfortunately our client wants everything done in SSIS in regards to dimension and fact table loads. They want to keep everything consistent on how they are used to handling the data loads. I will try dropping and adding the indexes during the ETL process.

tpolak

Posts : 6
Join date : 2012-12-13

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  Jeff Smith Fri Jan 04, 2013 12:00 pm

I don't think you want to drop and rebuild a clustered index on a table with 150 million rows. But I also think the clustered index shouldn't be on the Business Key.

In my opinion, you shouldn't have a clustered index on the dimension table. The dimension key should be defined as a clustered Primary Key. You can put a non-clustered index on the business key, but I'm not sure I see the point unless you are using it to Identify the rows from the source that needs to be added to the Dimension Table.

The Clustered index on the Business Key is going to kill the load. Every new insert is going to cause the table to resort itself, eating up a ton of temp space. Dropping the clustered index and then rebuilding it is going to cause the table to be resorted, eating up a ton of temp space. Besides, I would have thought that the only benefit to a clustered index on the Business key would be during the load process, because it's not going to help in querying.

I would redefine the clustered index as being a primary key on the Dimension key. Having it on the Business key serves no purpose. I would load the dimension table in 2 steps. First, I would create a staging table of just the rows from the source that I wanted to load into the dimension table. I would compare the source data to the rows in the dimension table and load any rows that are new or different into the Stage table. I would then use the stage table as the source for the dimension build.

I have a similar issue with a large dimension table and this is how I handled it.

Jeff Smith

Posts : 471
Join date : 2009-02-03

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  tpolak Fri Jan 04, 2013 12:10 pm

Thanks for your reply!

The reason why I had a clustered index on the business key is because I would perform a join to see which rows already existed in the dimension (updates) and which were new rows. I found out that having the index on the business key made the join perform alot faster on 150 million rows.

In your method, how would you determine what were updates and what were new rows? I usually will use a cached lookup for small dimensions, but didnt want to use one in this table which is why I performed a join back to the dimension table.

tpolak

Posts : 6
Join date : 2012-12-13

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  thedude Fri Jan 04, 2013 12:32 pm

I overlooked the clustered index, this is bad. You need a surrogate key on your dimension that is a primary key (and automatically a clustered index). Do not create a clustered index over a business key. Create a non-clustered index over the business key, this will be needed when processing incremental changes.

thedude

Posts : 21
Join date : 2009-02-03
Location : London

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  Jeff Smith Fri Jan 04, 2013 12:54 pm

The clustered index would make the join faster but will kill the load. Dropping it and recreating it removes any benefit of the clustered index, but even so, the clustering should be on the dimension key.

2 steps - ID new columns by comparing the fields you used to create the business key in an outer join.

For updates - After you've loaded the new rows, run update queries from the source data. I'll usually put in a WHERE statement limiting the updates to values that are different (UPDATE D Set D.col1 = S.col1 where D.col1 <> S.col1. This is done in an inner join limits the updating to rows that have actually changed. It's a little clunky but that's what happens when you have a dimension table with 150 million rows.

Are there really 150 million accounts? If not, then I would think about breaking up the dimension table so the the unique combinations of the business key significantly smaller.

Jeff Smith

Posts : 471
Join date : 2009-02-03

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  tpolak Fri Jan 04, 2013 1:05 pm

Thanks for your replies.

I had it as a clustered index because since it is a three part business key, I thought that may be faster. I will admit, I do not have much experiencing with indexing stratigies.

Also, there are 123 million distinct accounts, and the other 27 million rows are because there are two or more addresses linked to the account.

tpolak

Posts : 6
Join date : 2012-12-13

Back to top Go down

Very Large Dimension Strategy Empty Partitioned Clustered indexes

Post  bmansouri Wed Mar 20, 2013 8:46 pm

Did you try partitioned clustered indexes?
You can disable your index before ETL and rebuild only affected partition after load.

Ben Mansouri

bmansouri

Posts : 1
Join date : 2013-03-20

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  paultim374 Tue Jun 18, 2013 1:36 am

Dear I would redefine the clustered index as being a primary key on the Dimension key. Having it on the Business key serves no purpose. I would load the dimension table in 2 steps. First, I would create a staging table of just the rows from the source that I wanted to load into the dimension table. I would compare the source data to the rows in the dimension table and load any rows that are new or different into the Stage table. I would then use the stage table as the source for the dimension build. 
Are there really 150 million accounts? If not, then I would think about breaking up the dimension table so the the unique combinations of the business key significantly smaller...???

paultim374

Posts : 1
Join date : 2013-06-18

Back to top Go down

Very Large Dimension Strategy Empty Re: Very Large Dimension Strategy

Post  Sponsored content


Sponsored content


Back to top Go down

Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum