Data Vault v's Dimensional Model

Page 3 of 4 Previous  1, 2, 3, 4  Next

View previous topic View next topic Go down

Dimensional repository standing on it's own

Post  thomaskejser on Tue Aug 30, 2011 9:38 am

Hi ngalemmo

Thanks for chiming in. I agree that there is no need for any other repository than the dimensional model. However, I think Dan makes a very strong point in: http://www.tdan.com/view-articles/5054/

"The problem is scalability and flexibility. If an additional parent table is added, the change is forced to cascade down through all subordinate table structures. Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key."

This is really just a fancy way to say that if you change your mind about type2 history tracking, you are faced with a problem on your fact table keys. Because you now have to update all keys in all affected facts. I think this is the big one that has hurt some dimensional modelers when they try to scale their data model. Especially if you have trouble writing a fast update (and there are clearly engines that are better at this than others) or believe that there is some law that makes updates hard for databases to do at scale.

I think there is some point in "carving off" the type2 attributes into a separate table because of this issue (This is similar to Dan's satellite). The issue with doing this is that you now have to do a nasty BETWEEN join (on from/to dates) in order to access fact data. Such joins are hard for query optimizers to optimize - because it is difficult to get a good cardinality estimate. OF course, the increased price of the BETWEEN also bubbles up to either the user or the data mart loader - something that has to be accounted for.


Last edited by thomaskejser on Tue Aug 30, 2011 9:39 am; edited 1 time in total (Reason for editing : spelling errors)

thomaskejser

Posts: 12
Join date: 2011-08-14

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  ngalemmo on Tue Aug 30, 2011 11:41 am

If an additional parent table is added, the change is forced to cascade down through all subordinate table structures.


I guess it refers to adding a new dimension to an existing fact table. It can happen, but if it was designed right to begin with (enterprise view rather than building to specific reports) it would be rare. And if you are to apply the change retroactively, it is a little bit of work (maybe 10-20hrs).

Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key.

This I have no clue what it is referring to. If it is a type 2, you never go back and change fact keys when new rows are added. Even if you change the dimension from a type 2 to a type 1, there is never any reason to rekey the facts... you simply adjust the dimension.


The issue with doing this is that you now have to do a nasty BETWEEN join (on from/to dates) in order to access fact data.


Why would you need to do this? Its a simple join between a fact and a type 2. The only time you would filter a join against a type 2 is if you want to adjust the dimension to a point in time other than the point in time of the fact. In other words, getting the current dimension values. Simply implementing both a type 1 and type 2 version of the dimension clears that up.

The thing is, none of these are issues that cannot be resolved or handled in a dimensional model. There may be some simplicities that the data vault provides, but at what cost? It involves creating a separate and distinct environment purely to store data, with no guarantee that you may need to do some rework. Its a lot of effort to maybe save a few man weeks over a period of years because of unknown or unanticipated model changes.

There are a few other choice tidbits in the article:


One of the most difficult issues of a conformed data mart (or conformed fact tables) is getting the grain right. That means understanding the data as it is aggregated for each fact table and assuring that the aggregation will stay consistent for all time (during the life of the relationship) and the structure of each fact table will not change (i.e., no new dimensions will be added to either fact table). This limits design, scalability and flexibility of the data model.


This is kind of aluding to my earlier comment of building it right. If for a moment you think that data is aggregated for a fact table, you don't understand dimensional data warehousing. Fact tables should be built at the lowest level of detail available from the source data. Period. Aggregation is purely optional, stored in separate fact tables, and usually done for performance reasons.

Another issue is the “helper table.” This table is defined to be a dimension-to-dimension relationship Link.


A helper or bridge table sits between a fact and a dimension. It is not a dimension-to-dimension link. From a relational point of view, it is an associative entity used to resolve many-to-many relationships. Since a many-to-many relationship is the function of the relationships in the data, and not the modeling methodology, I fail to see how this is an issue with dimensional design.

Here's the thing. The data vault introduces a level of abstraction over what we would normally view as a data structure (a row and its columns). Fine, nothing wrong with that. And, if I was to implement a store and publish architecture for a data warehouse, it is a methodology well worth considering. But are data structures used in business really that dynamic that relatively minor maintainance issues matter? I haven't see that. What I have usually seen is large organizations spending years and countless millions of dollars updating their ERP. With timeframes like that, I wouldn't sweat the occasional schema change.

ngalemmo

Posts: 1732
Join date: 2009-05-15
Location: Los Angeles

View user profile http://aginity.com

Back to top Go down

The Type2 driven update statement

Post  thomaskejser on Tue Aug 30, 2011 11:58 am

The thing is, none of these are issues that cannot be resolved or handled in a dimensional model. There may be some simplicities that the data vault provides, but at what cost? It involves creating a separate and distinct environment purely to store data, with no guarantee that you may need to do some rework. Its a lot of effort to maybe save a few man weeks over a period of years because of unknown or unanticipated model changes.


...We are in agreement, ask Dan :-)

With regards to the Type2 changes, here is the scenario:

Let us say you have a some fact tables keyed to the customer dimension. You decide to do type2 change tracking on the customer city and type1 on the name.

At T(0) your world looks like this:

SK,CustomerID,Name,City,From, To
1, A, Thomas,London,2000, 2001
2, A, Thomas,Berlin,2002, 9999

Now, at T(1), you decide that you also want to do type2 tracking on customer education. This changes faster than City, so you now get a larger customer dimension with more rows, and more surrogate keys. For example this (using floats for SK, just for readability):

SK,Customer,Name,City,Education,From, To
1,A,Thomas,London,None,2000, 2001
1.1,A,Thomas,London,Master,2001, 2002
2,A,Thomas,Berlin,Master,2002, 9999

Here is the issue: the large fact tables that are keyed to The London (SK=1) now needs to have their SK updated to 1.1 if the transaction data is in 2001 and the education has changed to "master". This could potentially be a rather large update statement on several fact tables (travel the dependency tree here). If you are a PB sized installation and not careful about writing that update correctly, you could be looking at days, or even weeks until the change has been affected (DBAs get this type of stuff wrong all the time)

The alternative is to "fork off" the history. So you get this two-table "dimensional" structure

StableSK, Customer, Name
1,A, Thomas

HistorySK, StableSK, City, Education, From, To
42, 1, London, None, 2000, 2001
43, 1, London, Master, 2001, 2002
44, 1, Berlin, Master, 2002, 9999

If you key the fact on StableSK, you can change your mind as much as you like about the type2 changes, without incurring the big update. Of course, this means that the join between the history table and the fact table is much more expensive (since it is now a non-equi join, the BETWEEN I was referring to). I think that tradeoff is generally worth it, but the obvious issues about optimizer trouble of course rear their ugly heads here.


Last edited by thomaskejser on Tue Aug 30, 2011 12:02 pm; edited 2 times in total (Reason for editing : Minor spelling errors)

thomaskejser

Posts: 12
Join date: 2011-08-14

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  ngalemmo on Tue Aug 30, 2011 12:29 pm

The issue is retroactively adjusting history, not changing the conditions of a type 2. You can always change a type 2 and let things fall into place moving forward without a lot of effort.

And, yeah, if you had to retroactively change things, it would be a real pain in the neck. But, as yet, I haven't had to. And the idea of a stable surrogate key has been discussed before, but as an alternate key to a type 2, not as the PK (in conjunction with date) in one table rather than two.

I don't like the forking idea (no pun intended) of the static key as it impacts query performance for all time over a rare occurance of a one-time rekeying of the fact table to do retroactive history adjustments, assuming old history is even available from the source system.

ngalemmo

Posts: 1732
Join date: 2009-05-15
Location: Los Angeles

View user profile http://aginity.com

Back to top Go down

To fork or not to fork

Post  thomaskejser on Tue Aug 30, 2011 12:50 pm

ngalemmo wrote:
I don't like the forking idea (no pun intended) of the static key as it impacts query performance for all time over a rare occurance of a one-time rekeying of the fact table to do retroactive history adjustments, assuming old history is even available from the source system.


Still, a good pun :-)

I think it is a valid question to ask: "How large is that performance impact?". The re-keying is a real problem that I have experienced at scale. With badly written update statements, we are literally talking weeks of running, on even a "largish" 100TB system. You are amortizing a potentially VERY high pain over a lot of smaller user queries - and this may be a good idea, depending on the impact. But a well written UPDATE statement on an MPP should get you in the tens of GB/sec to the disks (especially if you are laid out well for sequential :-)

So the question is: is it worth giving up hash joins and trading them for loops (painful) and also: most column stores also struggle with this particular query pattern.

Note that by doing the forking trick, you only affect queries that need the type2 history, type1 queries can still go to the "StableSK" table.

I am also curious to know if any database vendors have a nifty way to perform this BETWEEN From/To join efficiently (there are clearly ways to get some of the hash performance here). Hope a DBA will chime in here.

thomaskejser

Posts: 12
Join date: 2011-08-14

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  hang on Tue Aug 30, 2011 2:08 pm

ngalemmo wrote:Can a dimensional EDW stand alone as the primary data store and historical record or is it necessary to create a repository from which data marts are published.

Hi Thomas, I suspect it is the intention of Master Data Services (MDS) in Denali (SQL Server). I have not got my hands on the new feature, but have seen powerful demo in PASS that you may incrementally add (publish) the new attributes to any dimension without developer doing any DDL or update on the table.

I have also seen the session by Joy Mundy explaining the role MDS can play to ensure data quality, proper metadata management etc. to streamline ETL process.

hang

Posts: 455
Join date: 2010-05-07
Location: Brisbane, Australia

View user profile

Back to top Go down

MDS and metadata

Post  thomaskejser on Tue Aug 30, 2011 2:20 pm

Hang

I think that Metadata and MDS are orthogonal to the "store and publish or just use star schema" architectural question. No matter which architecture you build, you will need both metadata and MDS to make it manageable at scale. Also, both Dan's Vault model and the star schema can be the recipients or semi-automated ETL flows auto generated from metadata. I have build meta frameworks that load both of them - and they are equally simply to load. Though the star schema lends itself better to parallelism because it has less key constraints to worry about. Of course, you can turn off key validation in the data vault, but that kind if defeats the purpose (and also leads optimizers astray).

All that being said, the notion that users can build their own hierarchies and add their own attributes to dimensions is a rather tempting idea and one you can see in Denali. There are cases where it makes sense for a user to have his own copy of a dimension table and then populate the attributes as they see fit (but maintaining the surrogate key). This of course creates some interesting indexing problems - but I think they are surmountable, even at scale.

thomaskejser

Posts: 12
Join date: 2011-08-14

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  hang on Tue Aug 30, 2011 3:20 pm

Thomas

Thanks for the comments. However MDS does have big implication on ETL process that has played a big part and determines the success of the dimensional modelling architecture. I guess with MDS in place, ETL process would become more manageable, or somehow automated as you put.

Could you please also clarify that the underlying data structure for MDS is more in line with DAN’s data vault model and quite different from dimensional model or any other traditional 3NF model per se.

hang

Posts: 455
Join date: 2010-05-07
Location: Brisbane, Australia

View user profile

Back to top Go down

MDM looks more like Vault - or does it?

Post  thomaskejser on Tue Aug 30, 2011 3:45 pm

Hi Hang

I agree that a Master data repository looks a lot more like vault at first glance. However, just because a schema has tables with start- and end dates does not make it a data vault - it just makes it a system that capture changes to tables (OLTP designers may have designed such "log tables" using triggers before).

If you are designing the schema for a master data management system - you generally want that schema to be extensible from the client to allow business users to model the world as they think it looks. This means that you are often building something that is somehow a form of name/value pair database - but which also should contain a materialized version of the same data into something that looks a lot like type2 dimensions so they are easy to query. Is that a Data Vault or just sound 3NF design?...

When I think about MDM, I think of it as one (maybe the only, if done right) source of dimension data - not facts. Because of that, it is a small database system that can take some liberties about the schema it uses (performance is less critical, though still important). But even small databases like that still have a lot of value - especially if the MDM knows how to handle keys properly.

Be aware that the MDM can actually make your life harder as an EDW designer: cleaning of data can take a long time if it is not done with good tuning. This means that the MDM can be "lagging behind" the sources from the perspective of the EDW (which may be connected directly to message busses and source extracts). This can lead to all sorts of interesting late/early arriving key problems. It is also not unusual to see the EDW solve data quality problems (because they need that solution NOW in their reports) before the rest of the organization is done doing vendor selection on a MDM system. You may have experienced the interesting political situations and discussions about who does the best job at handling keys and data cleaning - the EDW or the MDM?

With regards to key management: In my blog at http://blog.kejser.org I have described structure that I call "map tables" that are an important part of the ETL flow that I argue for. They look somewhat like Dan's satellites, but as you can see in some of the entires, they serve a differnet purpose and some interesting characteristics that help you manage key errors in the sources. Ideally, a MDM repository would supply tables that have that structure to the EDW - along with the cleaned source data of dimensions.


Last edited by thomaskejser on Tue Aug 30, 2011 3:52 pm; edited 1 time in total (Reason for editing : Clarifications)

thomaskejser

Posts: 12
Join date: 2011-08-14

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  dm_unseen on Wed Aug 31, 2011 1:13 am

Thomas,

Your "map tables" appear (in slightly different form) in a Data Vault in the Business (Rule) Data Vault section. Your alternate keys can be satistfied with keysats. Basically the same look and feel but slightly more flexible and because of auditability also usable for data quality feedback loops to OLTP source systems as well.

Another problem I have with all of your stuff is that most of it should handle bitemporality as well, else you will never be able to fix business rules without destroying your auditability. This goes for most of this disussion BTW.

Also when only the present is important in a DV you can always partition off the past, logically the past and present of the same info still form 1 entity (except for model changes). Physically splitting it and saying it is different is just doing physical optimizations on the database schema level.

However I do seriously doubt that your architecture will allow for an (small) incremental and flexible DWH implementation without serious rework. IMO it looks either still too much "big bang" which i think is one of the biggest issues of Kimball and Inmon "old style".

One of the interesting philisophical differences between Data Vault and kimball is that Data Vault tries to reconcilliate and align source and the DWH as much as possible whille Kimball tries to have "improved" or even "the best" data quite distinctly from the sources. Or to put it another way, Kimball polarizes your data between sources and DWH while DV tries to unify them. From a more holistic Data/information Architecture standpoint (not just DWH) I think unification is the better approach.


dm_unseen

Posts: 4
Join date: 2009-02-03
Location: Netherlands

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  ngalemmo on Wed Aug 31, 2011 7:01 am

Kimball polarizes your data between sources and DWH while DV tries to unify them.


Could you elaborate?

ngalemmo

Posts: 1732
Join date: 2009-05-15
Location: Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  dm_unseen on Wed Aug 31, 2011 7:15 am

With Data Vault we start off with assuming the source data is correct and complete. If not we initiate feedback loops to improve DQ. We also don't delete source data because we want to measure DQ improvement over time. We also try to solely rely on source key integration. In a Data Vault we don't want to create new "truth" by processing data, we want to report on the facts given by the source. When this is not possible we will reluctantly add additional layers to cleanse/conform the data, but by doing that we create another truth/view on the data different from the source, hampering auditability and obscuring the relation between operational processes and the DWH data.

Kimball just states to ignore the source systems asap when data is made available through the EDW, so source system data actually becomes 2nd best data compared to the cleansed & conformed data in the EDW. For Kimball bad DQ is an argument for a DWH to take over instead of trying to improve the DQ of the sources.

IMO kimball tries to sit on top (of te sources) while Data vault tries to align with them.

dm_unseen

Posts: 4
Join date: 2009-02-03
Location: Netherlands

View user profile

Back to top Go down

Unifying the sources

Post  thomaskejser on Wed Aug 31, 2011 7:17 am

Hi Dan

Agree that the maps are very similar to satellites - though note that they are specifically handled per source, and that I see them as an add-on to the dimensional model, a part of the ETL flow, not as some new and revolutionary modeling technique.

I don't follow why data vault (or an Inmon 3NF) somehow unifies the source - putting every source into the same database (even if you model it using some fixed ruleset) is just another way of saying "stovepipe", it does not embrace the fact that sources ARE different and need cleaning and semantic realignment to be turned into usable data, which is where the big work happens and those 80% of the time is spend. My point is exactly that you should NOT model the source, because it is a waste of time. Instead, if you must, throw it into some cheap big piles (HADOOP is good, if not, use some cheap database engine). Just keep the source schema as it is, perhaps adding a little log table to each table to track changes (if the source system doesn't do that already). That is as close to agile as it comes (but no silver bullet - which I think we agree does not really exist). If the source schema changes, create a new table or modify the existing one - no big deal, no special model technique.

Temporal relations must be handled in any data load scenario, it depends on what you consider the "valid date" for a fact table and which dates you use as the basis of valid from/to in the type2/history tables. But this does not change the fact that you still need some agreed on, common, definition of validity that users can report on. It may not be the "truth", but is it the agreed upon version of it. Again, it is the ETL flow that create the reporting schema which must eventually handle this..

Paraphrasing a point made earlier in this forum: Just because you shove a lot of data into a common model does not mean you are dodging a "big bang" implementation or being any more structured about it. The Kimball style EDW (as opposed to using Kimball just for marts) is exactly characterized by the fact that you can gradually build it up, one fact and dimension table at a time. This means you can get off the ground quick and deliver business value, without engaging in an academic modeling exercise. From the "One version of the truth" requirement, it does not follow that "everything must be in one, unified and pre-defined data model"

thomaskejser

Posts: 12
Join date: 2011-08-14

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  dm_unseen on Wed Aug 31, 2011 7:53 am

Thomas,

Some small but interesting remarks (rest comes later):

  1. With Data Vault we like to use the phrase "One version of the Facts". The reason is that truth is subjective and will change over the course of the life of a DWH. It means that all interpretation (BR,cleansing conforming and calculating) is to be done after the Central DWH and not before. This is different from classical inmon style and Kimball but in line with the latest inmon architecture.
  2. Data Vault does try to conform, but just not upto the point of doing actual compex ETL stuff. That is reserved for downstream data marts or business data vault. If this leaves you with a pile of datajunk instead of a nice DWH, well.... whose data problem is that?
  3. The argument between using a Data Vault or an Historical Staging Area IMO depends also on DWH usage. We try to server ODS style reports (real time as well) from our central DWH so Data Vault works quite well there. IMO an HSA is the 2nd best option after a Data Vault (better than nothing;)
  4. Kimball works per fact and dimension table but Data Vault works incrementally per *source* table (actually a set of source tables). In a Data Vault Dimension design & conformation is optional (as is the whole Kimball Architecture).


PS I'm not sure if you are confusing me (dm_unseen AKA Martijn Evers) with Dan Linstedt. Although we both know eachother and Data Vault very well we are still different persons :). It appears I'm currently standing in for him (fine by me) which means you get 2 "DV heads" for the thread of one.

PPS I'll see you at the upcoming SQLBits? I'll be doing "Data Vault in a Day" precon there

dm_unseen

Posts: 4
Join date: 2009-02-03
Location: Netherlands

View user profile

Back to top Go down

Re: Data Vault v's Dimensional Model

Post  ngalemmo on Wed Aug 31, 2011 9:06 am

dm_unseen wrote:With Data Vault we start off with assuming the source data is correct and complete.


I don't assume anything. Well, actually not true... I assume its all bad and don't worry about it. I am concerned about accuracy, not correctness.

We also don't delete source data because we want to measure DQ improvement over time.


What does this have to do with the model? Sounds more like a data retention policy. There is nothing in a dimensional model that precludes retention of old data. Its a design choice, not a mandate.

We also try to solely rely on source key integration. In a Data Vault we don't want to create new "truth" by processing data, we want to report on the facts given by the source.


Again, not a modeling difference. There are load techniques that allow for soft integration of disparate source in a dimensional model. All source data is retained, facts are tied to the dimension rows from the same source and provisions are made to accomodate a combined view across systems. Also, the primary purpose of a data warehouse is to be accurate. Correctness comes later.

When this is not possible we will reluctantly add additional layers to cleanse/conform the data, but by doing that we create another truth/view on the data different from the source, hampering auditability and obscuring the relation between operational processes and the DWH data.




Kimball just states to ignore the source systems asap when data is made available through the EDW, so source system data actually becomes 2nd best data compared to the cleansed & conformed data in the EDW. For Kimball bad DQ is an argument for a DWH to take over instead of trying to improve the DQ of the sources.


I don't quite follow. If you have placed the data into an accurate data warehouse, why would you want to have BI users go back to the source? What is the point of building a data warehouse if not to serve as the primary source for tactical and strategic data analysis? It is not a quality issue.

You are implying that dimensional modeling somehow dictates where and when data cleansing should take place. While Ralph may or may not have expressed opinions on this matter (I don't know), they have nothing to do with the modeling technique. It is a process issue. You can implement DQ feedback loops, in-line cleansing, or whatever technique you feel is appropriate to the case at hand.


IMO kimball tries to sit on top (of te sources) while Data vault tries to align with them.


I don't understand what you mean by this.


ngalemmo

Posts: 1732
Join date: 2009-05-15
Location: Los Angeles

View user profile http://aginity.com

Back to top Go down

Page 3 of 4 Previous  1, 2, 3, 4  Next

View previous topic View next topic Back to top

- Similar topics

Permissions in this forum:
You cannot reply to topics in this forum