Using bitemporal dimensions to show corrections

bandrews Mon Mar 12, 2012 4:42 pm

In my organization, the business users want to be able to report historical and current views of a sales rep. This includes any corrections that were made at the source and corrections brought through the DW.
Example:
The sales rep has a row in the table where her name is Kelly Brown with a dim id =1, valid to and from dates of 1/1/2001 to 4/10/2002, a source timestamp of 01/01/2001, and a source replaced indicator of "N". She marries, and a new row is inserted with a dim id =2where her name is Kelli Smith with valid dates of 4/10/2002 to high values, a source timestamp of 04/10/2002, and a source replaced indicator of "N". . Then it is noticed that there was an error on the first row. Her first name is misspelled (Kelly instead of Kelli). The business wants this correction in the database for compliance reasons, so a row is inserted with with a dim id =3, name= Kelli Brown, with valid to and from dates of 1/1/2001 to 4/10/2002, a source timestamp of 01/01/2001, and a source replaced indicator of "Y". Because no new fact rows were written with this new row, no dim IDs are on the fact table corresponding to this row(Dim ID = 3). For reporting the business sometimes wants current look (dim ID = 2) sometimes wants original look (Dim ID = 1) and sometimes wants the corrected view (Dim ID =3). We are thinking od using the dim ID plus an insert timestamp in the fact. Any suggestions/comments from others who have had to incorporate dimensional modeling into the DW? Thank you

John Simon Mon Mar 12, 2012 6:27 pm

This is not the answer you were asking for, but I would be questioning what they are going to do with that information. Are they really going to make a buiness decision based on her name being spelt Kelli or Kelly? It seems unlikely. Essentially their requirement will make the solution more difficult to develop, test and maintain, add no real business value.

Regarding a technical solution, you could also include the natural key of the agent dimension in the fact table, and do a date range lookup.

ngalemmo Mon Mar 12, 2012 6:30 pm

What is the source timestamp supposed to represent? Since you already have effective and expiration dates, presumably from the source, why would you not store the timestamp of when the update actually took place?

As far as getting the 'corrected' view, the flag you are using makes such queries difficult to do. It is conceivable there may be multiple corrected views in some cases, all for the same period of time. Given a particular point in time for which you want to report the dimension (say as of 5/1/2001) it would be safe to assume that the users would want to see the corrected version or the original version if the row has never been corrected. The way you are setting the source replaced indicator would not allow that. If, however, you always set the flag to 'Y' when a row is inserted, and then, if the row is superseded, you set the old row's flag to 'N', then you can simply search for the flag = 'Y' and get either the corrected row or the original row if it has never been corrected. You can add another column, such as an original row flag, to allow you to exclude rows that have never been corrected when desired (original row = 'N' and source replaced = 'Y').

temporalcraig Tue Mar 13, 2012 4:08 am

This is a common requirement for environments that need to have a full audit trail of what you knew and when you knew it for compliance/regulatory purposes. It may also be required when you need the ability to reproduce report results AND create reports using the latest/corrected data AND understand reasons for changes between original report results and those using the latest data. Sometimes it is unacceptable to run a report of last month’s sales to people with the first name Kelly in the morning and get one result and run it again in the afternoon and get another.

I am guessing that name corrections/changes are just an example and corrections to other attributes might be more important (like security clearance or certification level).

Unfortunately there are not any simple solutions for fully supporting this requirement and you are correct that bitemporal is the term used in this problem space.

Out of the box dimensional modeling does not support this to the best of my knowledge (I have not seen any writings by Ralph Kimball himself on supporting bitemporal requirements). However there are techniques to augment standard dimensional modeling (like what I refer to as the 3 timestamp approach - see the “Bi Temporal - As At reporting,” thread in this forum) which can support bitemporal requirements.

It is also my understanding that for corrections to dimension data (or late arriving data) Kimball suggests going out to all of the fact tables that are related to the sales rep dimension and “destructively change” the FKs pointing to dim row #1 so that they point to dim row #3 (using your example). (see http://www.kimballgroup.com/html/articles_search/articles2000/0009IE.html?TrkID=IE2000_09). This can be extremely expensive and complex (in many cases prohibitively so) and even if you do this you now have the problem that the report you ran in the morning will be different than the report run in the afternoon (and it will be hard to give a good explanation to your boss). And of course you will not have a complete audit trail of what you knew when for compliance/regulatory purposes.

The good news is that extensions to ISO/ANSI SQL have just been published which support bitemporal data. And most of the major DBMS vendors are integrating support for these (or similar) extensions into their DBMS engines.

If you are interested in finding out more information about these extensions here is a link to a presentation by the primary author of them.

http://metadata-standards.org/Document-library/Documents-by-number/WG2-N1501-N1550/WG2_N1536_koa046-Temporal-features-in-SQL-standard.pdf

And here is a link to another presentation which gives some historical background and some more details on the ISO/ANSI SQL syntax.

http://docs.temporaldata.com/Temporal_extensions_to_SQL_2011_20120104.pdf

And if you REALLY want more information here is a link to a 7 part youtube video which gives a lot of general information about bitemporal data and details on certain aspects (including sample code for what I refer to as the 3 timestamp approach)

https://www.youtube.com/playlist?list=PL4CB3C8161D2804E6

Cheers,

craig

ngalemmo Tue Mar 13, 2012 2:40 pm

Out of the box dimensional modeling does not support this to the best of my knowledge (I have not seen any writings by Ralph Kimball himself on supporting bitemporal requirements). However there are techniques to augment standard dimensional modeling (like what I refer to as the 3 timestamp approach - see the “Bi Temporal - As At reporting,” thread in this forum) which can support bitemporal requirements.

I think this is a somewhat misleading and unfair statement. While various books and writings on Dimensional Modeling use examples which are fleshed out models with attributes and measures, there is nothing about Dimensional Modeling that dictates what those attributes and measures should be. Supporting bi-temporal capabilities within a dimensional model is a matter of what attributes you place in a dimension table and/or dimensions in a fact table, it has nothing to do with the dimensional model structure or modeling methodology.

So, yes, dimensional modeling does support bi-temporal requirements. Just like any other data modeling methodology. No data modeling technique dictates content within the model.

As far as late arriving dimensions go, you miss the point of what was being discussed. One of the foundations of a dimension model is the notion of surrogate keys and the use of those keys to relate fact rows with dimension rows. The problem with late arriving dimensions is a fact may have already been processed and references a dimension row. Because of the late data, it needs to reference another row so it correctly references the dimension data it should have been referencing in the first place. Such a correction is viewed in the light of such an occurrence being unusual and rare and that the fix is to correct it. If such occurrences are not unusual or rare, then it is an issue that needs to be dealt with by other means.

temporalcraig Wed Mar 14, 2012 2:21 am

I hope that folks understood that when I say”out of the box” and “standard” dimensional modeling I am referring to dimensional modeling as articulated and taught by Ralph Kimball. So I do not think it is misleading or unfair to say that it does not support bitemporal requirements unless someone can point me to some writings that I have missed (and I will be grateful for such!).

I have just provided a link to an article (and I can supply others) where the instructions for how to deal with late arriving data and corrections not only does not support bitemporal requirements but is also impractical to use in many cases even though it only supports less demanding requirements. We need the equivalent for how to support bitemporal requirements.

Dimensional Modeling as taught by Ralph Kimball does dictate that certain attributes should be used and how they need to be maintained. It may give you different options, but there is a finite set of approaches for how you design your fact and dimension tables etc. These approaches use specific attributes and attribute behaviors etc. That is what is so great about Ralph, he lays out big concepts and he backs them up with “here is the SQL".

Supporting bitemporal requirements with a dimensional model is more than a matter of what attributes you place on a dimension or fact table but that is part of it. When you have specified what those attributes are, what their constraints are and how they need to be maintained you then have a modeling methodology for doing this. If you do not have this you do not have a modeling methodology that supports it. So it has everything to do with modeling methodology.

Again, if dimensional modeling as taught by Ralph Kimball (or others) supports bitemporal requirements I think that a lot of folks would appreciate reviewing any documentation regarding this. Please remember that not precluding does not equal support. If we can review the pros/cons of these approaches I think it will be very helpful to people including the original poster. If not then I think we can agree that dimensional modeling as prescribed by Ralph Kimball (and others?) does not support bitemporal requirements.

Re late arriving dimensions, I do not see how I missed the point being discussed but I do find it interesting that dimensional modeling now has a foundation based on dictated attributes (sequential keys and FKs to them) when previously it was stated that “there is nothing about dimensional modeling that dictates what those attributes and measures should be”. The inability of dimensional modeling as taught by Ralph Kimball to provide repeatable query results or a history of changes/corrections when dealing with late arriving data (and corrections – which was the point being discussed) is not addressed by saying that we are fixing data that was incorrect or that late/incorrect data is sometimes rare.

But this is mostly a distraction; I think the point was to see how bitemporal requirements can be supported. So I suggest we get back to that as I think it is an interesting and important topic and the point of the original post. And I am looking forward to that documentation of dimensional modeling’s support for bitemporal requirements! (as prescribed by Ralph or otherwise)

John Simon Wed Mar 14, 2012 3:08 am

In Ralph's Dimensional Modeling in Depth training course, it is described how to show all changes - use SCD Type 2, or if there are SCD Type 1 attributes, how to model this is described using Date range lookups.

ngalemmo Wed Mar 14, 2012 3:21 am

I am not saying it cannot be done, nor am I saying it should not be done. What I am saying is as written it was presented as a use case with a resolution based on the particular case. Because a specific example of another use case is not presented, doesn't mean that others should not exist. So, to extend the art to show patterns for dealing with bi-temporal cases is all fine and good. But don't say its 'not supported'.

Anyway, the basic method that was being demonstrated in the example is you need to update the fact table to point the the appropriate dimension row, a row you added to reflect the change. Well, the same thing in a bi-temporal situation except you add two rows. One dimension row for future fact rows (the current row) with appropriate attribute and time values, then a second row with the attributes from the original row, but date attributes modified to reflect when the change occurred. You repoint selected fact rows to the second new row.

So, a small change to what Kimball discusses can cover a bi-temporal situation as well. It becomes a matter of redirecting specific facts to a new context. But it is messy and not done lightly. Other techniques, using deltas, can be more eligant.

As far as surrogate keys go (not "(sequential keys and FKs to them)"), the keys simply provide a means to associate a row in the fact table (FK) to a row in the dimension table (PK). It is what you put in the WHERE clause. The content of the query/report is based solely on the content (attributes and measures) of these related rows. There is no meaning in a surrogate key value, it is not 'sequential'. In fact, there is an error in the article, where it reads "why the surrogate keys for all dimensions, except time, cannot be ordered in any way" is incorrect, this was written in 2000, times change. A surrogate key should never be used to sort anything, even if it is a key to a time dimension. Database systems exist today that cannot guarantee sequential assignment of surrogate key values. There is no reason to expect a surrogate key to represent anything other than a link between rows. Chronology should be determined by attributes in the time dimension. (If you follow this basic rule, you can do some cool tricks with the environment using shadow dimensions).

As for the rest of it, facts provide measures, dimensions provide context, the fact FKs place the measures in context, there is a many-to-one relationship from fact to dimension, use a bridge for many-to-many relationships, there is no (active) relationship between dimension tables, one grain to a fact table, and try to keep your measures fully additive. I think that is about it.

temporalcraig Wed Mar 14, 2012 9:46 pm

John – Type 2 dimensions as described in any Kimball writings I have seen do not support showing all changes. One simple test is that they would need to allow overlapping effective/valid date ranges for the same object would need to be supported to do so.

Ngalemmo – Let me try this again, avoiding the words “not supported”.

Support for bitemporal requirements can be achieved using dimensional modeling. It just so happens that Ralph Kimball has never written about how to specifically do this (to my knowledge) so we are unsure of what he thinks about any approach for doing this. Robust support for corrections and late arriving data are strengths of bitemporal data and bitemporal requirements are usually driven by this.

The only approach that Dr Kimball has given us for dealing with corrections and late arriving data is the one referenced in the article I sited (the same approach is reiterated in his books and subsequent articles). This approach not only does not support bitemporal requirements but it is also impractical to use in many cases even though it only supports less demanding requirements.

The most common way to extend dimensional modeling as taught by Dr Kimball to support bitemporal requirements (to the best of my knowledge) is what I refer to as the 3 timestamp approach which I describe in the “Bi Temporal – As At reporting,” thread in this forum. There seems to be little formal documentation of this approach especially in the context of dimensional modeling. It is unclear who if anyone advocates this approach and if so under what conditions (small volumes, acceptable query response times, only for a limited number of tables etc.).

If there was some formal write up or article on the 3 timestamp approach (or any other approach) for supporting bitemporal requirements in a dimensional modeling context it seems certain that it might be helpful. But alas this does not seem to be forthcoming.

How is that?

Re “well the same thing in a bi-temporal situation except you add two rows”

I am pretty well schooled in the different approaches to supporting bitemporal requirements and I am not sure what approach this is describing. If it is the 3 timestamp approach only 1 new dimension row is necessary, if it is the 4 timestamp approach 1 or 2 or 3 new rows might be necessary (as I describe on slide 18 of the previously referenced presentation on the new iso/ansi syntax). At a minimum this trivializes the impact of whatever approach is being suggested especially since query considerations are not mentioned.

Re, “you repoint selected fact rows to the new row”; I do not see this being necessary as the type 1 FKs should be used in joins as appropriately suggested in the “Bi Temporal – As At reporting,” thread.

I do hope I am not off base because I do not understand something here.

But my main point on this is that where is the documentation? Where is the article/white paper? And that is why I make the point about Ralph Kimball not supporting bitemporal requirements, because when he talks about how to address an issue he gives a solution and stands behind it. I do not see this for any solution for how to support bitemporal requirements with dimensional modeling.

Re “a small change to what Kimball discusses can cover bi-temporal situation as well”; I do have to say that I consider this a mischaracterization.

When I said “sequential” I meant “surrogate” I am sorry to have caused that tangent.

cb

ngalemmo Wed Mar 14, 2012 11:05 pm

Look at the basic premises inherent in a dimensional model:

1. The attributes of a dimension provide context to the measures in a fact table.
2. Bi-temporal considerations require certain attributes (timestamps, etc...) be maintained on said dimension.
3. A fact FK always references a specific row. That row represents either the current view of the attributes (a Type 1 dimension) or the view that existed at the point of time of the transaction (Type 2).

In a dimensional treatment, the context is represented in the dimension row. If there was a late arriving change to a row in a type 2 dimension, the contexts that this row must represent are now three: the original version of the row, which would be referenced by facts that were in the system prior to this change taking effect; the new version of the row, would be used by all future facts and represent the current state; the corrected version of the row with appropriate date settings to clarify characteristics of the change, fact rows that were affected by the change would reference this row.

It doesn't matter in a type 1 dimension as the row always represents current state... no history.

As mentioned there are ways to effect this in a fact table that does not require actually updating the fact table using an insert only delta load.

If there is a need to restate facts as of a point in time, you would self-join on the dimension to locate the context you need.

Now, there have been discussions to supplement a type 2 dimension with a type 1 key, and carry both in a fact table. The sole purpose of this is to eliminate the need for the self join. You do not want to store the type 1 key alone because you lose the precise context that one implements a type 2 for... the point-in-time context of the measure. A type 2 key does not leave this up to interpretation.

Bottom line is queries to a dimensional model are guided by the attributes used to provide context to a fact. You are free to define whatever attributes you want to support the functionality you wish to implement (such as the 3 timestamp idea).

And as far as who writes what, I am sure Ralph is spending a lot more time with his wife. After all, nobody is getting any younger.

temporalcraig Fri Mar 16, 2012 8:14 am

Interesting.

I think this approach might be the 3 timestamp approach with a twist and you do the combination of max and <= on the self join when you want to “restate facts as of a point in time”.

But I am still not sure I completely understand. (Especially the “appropriate date settings to clarify characteristics of the change” for the corrected version of the row)

Is this the way the dimension rows would look AFTER late arriving dimension data that was received on 1/5/2011 (I use day level of granularity for simplicity purposes and close-open convention for date ranges)?

Original row (fact rows tied to time dimension rows for 1/1/2011 and 1/2/2011 are linked to this)
Type 2 key = 6
Type 1 key = 2
Effective State Date = 1/1/2011
Effective End Date = 12/31/9999
Insert Date = 1/1/2011
Current Flag = N
Credit Rating = A

New Row (fact rows tied to time dimension rows for 1/5/2011 onward are linked to this)
Type 2 key = 7
Type 1 key = 2
Effective State Date = 1/3/2011
Effective End Date = 12/31/9999
Insert Date = 1/5/2011
Current Flag = Y
Credit Rating = B

Corrected Row (fact rows tied to time dimension rows for 1/3/2011 and 1/4/2011 are linked to this)
Type 2 key = 8
Type 1 key = 2
Effective State Date = 1/1/2011
Effective End Date = 12/31/9999
Insert Date = 1/5/2011
Current Flag = N
Credit Rating = B

I am pretty sure that I have some valid start/end dates wrong here… but I start with this because a pure 3 timestamp approach just needs the first 2 rows.

I am very keen to understand this so I appreciate the feedback.

And very minor correction I think:

“the contexts that this DIMENSION must represent are now three”

craig

temporalcraig Sun Mar 18, 2012 11:15 pm

I still can not understand the need for/benefit of 2 new rows to support bitemporal requirements when late arriving dimension data is received for an object (or there is a correction) as described by ngalemmo

I do hope we get some clarification on that (as the 3 timestamp approach just requires one new row).

But in the meantime I will give a more robust example and illustrate how bitemporal requirements for the example are supported using the 3 timestamp approach (first using just a dimension table and then adding a fact table and a typical dimension table joined to fact table query).

Following is a link to a page that describes 5 events impacting the dimension table in the example.

http://bitemporalmodeling.com/a-temporal-example/

And following is a link to a page that includes an animation of how at attribute (credit rating) in the dimension table in the example is impacted by the 5 events.

http://bitemporalmodeling.com/bitemporal-support-for-a-temporal-example/

I just ran out of steam/time…. But if there is any interest as a next step I can provide a table layout for the dimension table in the example, how data should look in the table after the 5 events have been processed and a sample bitemporal query against the dimension table.

And ultimately I can show an alternative approach that does not require 1) updating ALL of the fact tables in response to late arriving date/corrections and 2) an expensive combination of max and <= on queries.

craig

ngalemmo Mon Mar 19, 2012 11:32 am

You can do it with one. It all depends if you want to distinguish between facts that have been restated and facts that have not. The dimension row could have an attribute to indicate a restatement.

Sponsored content

Using bitemporal dimensions to show corrections

Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections

Re: Using bitemporal dimensions to show corrections