Kimball Forum
Would you like to react to this message? Create an account in a few clicks or log in to continue.

Staging and Highly Coupled Columns

3 posters

Go down

Staging and Highly Coupled Columns Empty Staging and Highly Coupled Columns

Post  TimothyR Thu Oct 18, 2012 8:59 pm

This is my first post, so please bare with me.

Synopsis:
For most of our Landing/Staging Tables we have a series of what I call "Highly Coupled" Columns where the first stores a date while the second stores a code which represents how that date was derived within the DW. Due to the ETL which originally updates these columns and subsequent refreshes, analysis has shown that these columns are getting out of sync for 2 to 3 percent of the rows (this percentage various from table to table), thus the Code Column no longer reflects the correct representation of the Date Column.

Questions:
1. Is there an industry term for this kind of coupling?
2. Is there an DW Principle of which I can quote or refer to, to back up my argument that we need to take action to correct this situation and improve our Data Quality.

Many thanks.

TimothyR

Posts : 4
Join date : 2012-10-18

Back to top Go down

Staging and Highly Coupled Columns Empty Re: Staging and Highly Coupled Columns

Post  ngalemmo Fri Oct 19, 2012 2:40 am

What do you mean by "code which represents how that date was derived within the DW"? Why is the DW deriving dates?
ngalemmo
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

http://aginity.com

Back to top Go down

Staging and Highly Coupled Columns Empty Re: Staging and Highly Coupled Columns

Post  BoxesAndLines Fri Oct 19, 2012 3:45 pm

In normalization terms, this is a violation of third normal form. The pain you are are feeling is why in application databases violating 3NF is a bad idea. Fixing it is a matter of how important the data is to the business.
BoxesAndLines
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

Back to top Go down

Staging and Highly Coupled Columns Empty As to why dates are derived within our EDW

Post  TimothyR Sun Oct 21, 2012 11:56 pm

Within our EDW, these crucial dates are utilised for Temporisation. As all rows require a date to provide the temporal context, it maybe required to set it to a default date when the source data is unable to provide a valid date. For example, many sources contain a Business relevant date which is then used to satisfy our "Valid" Time Line. When either a date is not available or the date supplied is an invalid date then a default date is used. The Code column simply indicates whether the date was obtained from the source or a default was used.

TimothyR

Posts : 4
Join date : 2012-10-18

Back to top Go down

Staging and Highly Coupled Columns Empty Re: Staging and Highly Coupled Columns

Post  ngalemmo Mon Oct 22, 2012 12:13 am

1. No, there isn't an industry term I can think of off hand. It is not unusual. You have a flag that goes along with the date.
2. I don't understand what you mean by 'out of sync'. Since the code can be one or the other, how would you know what is 'in sync' and what isn't? I am assuming that the 'default' date could be the system current timestamp, rather than a fixed value. If you are seeing things like nulls for what should be a valid date, it sounds like you need to debug your ETL process.
ngalemmo
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

http://aginity.com

Back to top Go down

Staging and Highly Coupled Columns Empty Debugging ETL Process and Data Accuracy issues

Post  TimothyR Thu Oct 25, 2012 11:25 pm

We are in the process of debugging the ETL Process and rectifying the code. This still leaves the issue of the lack of Data Accuracy in the physical Tables which hold this data. When I say "out of sync" what I mean is that in the case where the Business Date is used, the flag would indicate the Default Date was used and vice-versa. The flag doesn't reflect the "True" situation and we have a lack of Data Accuracy.

In our organisation in relation to our EDW, different people have different views of what "Data Quality" is or means. Many people believe that we should be happy with 99% or 98% Data Accuracy within the Warehouse and only then when it impacts the final presentation layer. Therefore, if you only have say 90% Data Accuracy for a particular Data Item within Landing or Staging Table, then as long as the end user reports aren't affected then we can tolerate the 90% Data Accuracy. My philosophy is, if you go to the trouble of physically storing the data for later reference, then you should be concerned about Data Accuracy and Data Quality at all stages within your Warehouse.

What I am seeking, is some principle to back up my argument, to go back and fix it all up.

TimothyR

Posts : 4
Join date : 2012-10-18

Back to top Go down

Staging and Highly Coupled Columns Empty Re: Staging and Highly Coupled Columns

Post  ngalemmo Fri Oct 26, 2012 12:01 pm

Data Quality is about correctness, not accuracy. It is addressed in the operational systems, not the data warehouse.

The data warehouse should always be 100% accurate. What this means is, if given a set of source data, it accurately reflects that data in the DW. In other words, the load process doesn't have bugs and does not distort the data it has received.
ngalemmo
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

http://aginity.com

Back to top Go down

Staging and Highly Coupled Columns Empty Do no Harm

Post  TimothyR Tue Nov 06, 2012 3:13 am

It is this very statement "The data warehouse should always be 100% accurate." which I wish to sell or convince people of. My philosophy is "Do no Harm". I am up against a lot of resistance when attempting to address these very Data Accuracy issues within our EDW. There are many aspects of our EDW where the Source System Data is distorted (or misrepresented) somehow.

I am not necessarily a Warehousing expert. My expertise comes from my many years working in the Management Information Arena. I work with a lot of people who claim to be Warehousing experts (many are contractors/consultants) and they seem to be content with the EDW having 98% accuracy (of the operational system data holdings).

I would be interested in any documented Data Warehousing Principle or Water Paper that I could use to prove my point.

TimothyR

Posts : 4
Join date : 2012-10-18

Back to top Go down

Staging and Highly Coupled Columns Empty Re: Staging and Highly Coupled Columns

Post  ngalemmo Tue Nov 06, 2012 4:13 am

Accurate doesn't mean correct. It means that whatever the data that is being loaded, no information is lost or distorted. It would seem that that is an easily attainable goal.

Accuracy issues, if any, should be derived values, otherwise know as the application of business rules to the data. This rules may change, and may change fairly often. Or they are simply not well defined. This could lead to the appearance of inaccuracy.

The trick is to retain the original raw data with the derived values. You may not expose the raw value to end-users (you could if they are needed), but you retain them so that derived values can be restated should the business definition change.

Latency is also an issue sometimes. If the extracts from the source are performed while other update activity is taking place, the extract may miss the results of update queries running at the same time. If you were to compare the operational data to the stored data warehouse date for the same percieved moment in time, the query will see different values because the missed updates would have completed and be visible to new transactions. This would cause inaccuracy depending on the process. If you are using time based filtering (extract anything update/new since the last extract), the missing updates could be lost if there is no mechanism to recover them the next load.

There are a variety of ways to avoid this. You can use a push architecture where the operational system delivers logs or extracts. If done correctly, it would insure the data warehouse gets an accurate history (This is typical of ERP systems). The other is to lag the extract using a pull architecture. If an extract is started, it would look for transactions that completed some time in the past, like 1 minute ago. The amount of time you need to lag depends on the kind of update activity that is taking place.
ngalemmo
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

http://aginity.com

Back to top Go down

Staging and Highly Coupled Columns Empty Re: Staging and Highly Coupled Columns

Post  Sponsored content


Sponsored content


Back to top Go down

Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum