Dimension loading sources

bmoraillon Sun Jun 06, 2010 9:13 am

Hi all,

It's about datawarehouse dimension loading :
should i load dimension from a master data management referential or from the truth of my business application.
In the first case, dimension will be cleaner than in the second case but it will not reflect the truth.

Many thanks and best regards.

BoxesAndLines Mon Jun 07, 2010 11:17 am

If your MDM does not reflect the truth, then it's not really a MDM repository. MDM data usually contains more information than an application data store (e.g. integrated and cleansed customers). For that reason it is a nice source for the data warehouse.

ngalemmo Mon Jun 07, 2010 12:07 pm

It's really a question for the business and chances are the answer will be a bit of both.

One thing for certain is you need to retain the source system identity in the dimension tables. This means, if you have two sources for customer, for example, there should be dimension rows for each customer from each source with the appropriate source natural key. The MDM should provide some form of cross reference between source keys and the MDM's own internal identifier if it has one. The MDM would be used to update attributes from it for each dimension row it applies to. If it is based on the MDM's internal key, then you should store that as an alternate key for the table.

bmoraillon Tue Jun 08, 2010 2:49 am

here is the two case :
Case 1 :
Dimension loading sources Case111

or

Case 2 :
Dimension loading sources Case211

Many thanks.

ngalemmo Tue Jun 08, 2010 12:13 pm

If the MDM feed provides both the source system identifier and the source system key, you can get away with option 2. Otherwise you need a third option, where dimensions originate from the source system with an appropriate source identifier and the source key, then use the MDM feed to modify attributes in the dimensions.

bmoraillon Tue Jun 08, 2010 1:13 pm

Many thanks for your answers.

bmoraillon Tue Jun 08, 2010 3:16 pm

@ ngalemmo : why does the dimension source natural key must be retrieved ?
if i understand, i'll have to generate a surogate key for each rows of my dimension
but also store the source id as :
surrogate key | name | source key
1 customer1 234
2 customer2 4096

can you tell me if i'm right and why do i have to store source key.
Many thanks.

ngalemmo Tue Jun 08, 2010 5:11 pm

Rows always have a surrogate primary key. By 'identity' I am referring to how you identify the dimension row when doing lookups to get the surrogate key.

If you are consolidating dimensions from different source systems, it is important to maintain a row for each instance from each source. To do so, your natural key should be a combination of the system's identifier (such as customer number) and a code indicatating the source system. These two values ensure unique natural keys.

If your MDM combines customers from different systems and assigns is own identifier, you can use that identifer as an alternate natural key. You should not use that identifier as the natural key because the association of two disparate customer rows by an MDM system is not stable.

VHF Mon Jun 14, 2010 2:56 pm

ngalemmo wrote:If you are consolidating dimensions from different source systems, it is important to maintain a row for each instance from each source.

I’d like to explore this a little further, as I will soon need to load data from multiple ERP source systems. There will be significant overlap in dimension data between the two system. For example, most customers will exist in both systems (with 2010 sales being recorded in System1 and 2011 sales being recorded in System2.)

I’ve been considering two design options for the customer dimension:

Option 1:

SK	System1_Customer_No	System2_Customer_No	Customer_Name
1	12345	10076543	XYZ, Inc.

Option 2:

SK	Source_System	Customer_No	Customer_Name
1	System1	12345	XYZ, Inc.
2	System2	10076543	XYZ, Inc.

(Where Source_System + Customer_No is the natural key.)

I understand that with the second option we can still roll up facts from both source systems to a single customer based on Customer_Name, but we now have two current records in the dimension representing a single customer. I realize option 1 doesn’t work if there are many source systems, but with only two (or a few) source systems doesn’t this provide a simpler model (i.e. a single customer represented by a single dimension record)?

Why is it beneficial to maintain a row for each instance from each source?

ngalemmo Mon Jun 14, 2010 3:10 pm

Option 1 also doesn't work if there is an error in the cross-reference. You are faced with the problem of sorting out which facts refered to which customer after you get the xref errors resolved.

The main benefit is that it is the source transaction system that is telling you who it is dealing with. By maintaining separate rows for each instance, you retain that relationship. Any relationship between a customer in one system and a customer in another system is purely conjecture and is subject to change. By retaining a customer's source identity, you always relate the facts to the correct customer. How one interprets/integrates that customer is a matter of the attributes, which can be easily changed if necessary as the interpretation changes.

VHF Mon Jun 14, 2010 3:33 pm

However, I need to be able to correlate customer records from both source systems in order to do year-over-year reporting by customer.

In my particular case, there will a cut-over from one ERP system to another (which might be a little different scenario than two systems that remain on-line simultaneously.)

System1 is scheduled to record sales through the end of 2010. System2 will begin recording sales transactions starting in 2011. I need to be able to consolidate sales facts by customer across years (and thus across originating source systems.)

When System2 goes live, I will be provided with a mapping from old to new customer numbers, so this would allow me to populate the [System2_Customer_No] field for existing customers in my Option 1 example. (New customers that are created in 2011 will exist only in System2 and would be created in the customer dimension during ETL from System2. Obviously they will not have a [System1_Customer_No].)

Given (a.) the need to report by customer across years (i.e. across source systems) and (b.) the availability of a customer number mapping between systems, should I still create a separate row for each customer for each source system? If so, is there a better way to correlate the customer records that represent a single customer than relying on the [Customer_Name] attribute?

ngalemmo Mon Jun 14, 2010 3:45 pm

In any dimensional query, data is integrated based on the value of attributes, not keys. So, to answer your question with another question... what attributes would you need to consolidate the data?

Answer that question and add the attributes to the table. For example, you may want to add an attribute that contains the new system's identifier. This can be set/updated based on a cross-reference. It can also be used to update old system rows with new system data.

VHF Mon Jun 14, 2010 4:26 pm

ngalemmo wrote:In any dimensional query, data is integrated based on the value of attributes, not keys.

Excellent point! I need to ask my business users which customer number they expect to see on reports once we have cut-over to ERP System2.

I am almost convinced that Option 2 (separate records) is preferable, but let me throw out one more argument in favor of the Option 1 (single record) approach:

My customer dimension contains additional attributes, all of which are SCD1. Some do change occasionally, such as [Sales_Territory]. All reporting (including reporting of facts for prior years) is done using current territory assignments.*

If I have separate dimension records (as in Option 2), and assuming that System1 doesn't send any updates after 2010, a territory change that occurs in 2011 will be reflected only in the facts that relate to the the new System2 record for the customer. But correct SCD1 behavior is that all facts should reflect the attribute change.

SK	Source_System	Customer_No	Customer_Name	Sales_Territory
1	System1	12345	XYZ, Inc.	100
2	System2	10076543	XYZ, Inc.	200

(After SCD1 change arrives from System2 changing XYZ, Inc. from sales territory 100 to 200.)

With Option 1, the change will be automaticlly reflected in all reporting for the customer (including fact records that were derived from System1.)

SK	System1_Customer_No	System2_Customer_No	Customer_Name	Sales_Territory
1	12345	10076543	XYZ, Inc.	200

(After SCD1 change arrives from System2 changing XYZ, Inc. from sales territory 100 to 200.)

So I guess I'm still sitting on the fence! I suppose that with Option 2 (multiple records) I could have the ETL process update SCD1 attributes in all dimension records that exist for a given customer, but I like the simplicity of the Opton 1 design (single dimension record per customer.) However, I don't want to choose a poor design.

*I suspect that someday someone will ask for a report based on historical territory assignments, so I am considering also adding a SCD2 customer dimension, but that is a different discussion!

ngalemmo Mon Jun 14, 2010 6:08 pm

Add an 'update natural key' attribute. This would be a non-unique alternate key different from the identity natural key. It can be populated from the cross reference. Use that key to update rows. It's not a whole lot different than if you have a type 2 dimension where some of the attributes need to be treated as type 1. You simply update more than one row each time.

It's true option 1 is simpler (provided the cross reference doesn't change), until you need to consider a third source. But, option 2 really isn't that complicated either, and it will handle any sort of future scenario gracefully and well as being tolerant of potential errors .

BoxesAndLines Mon Jun 14, 2010 10:46 pm

I would strive to present an integrated view of the customer if at all possible. This can be done with many of the data quality tools available on the market today. This would be closer to option 1 without the redundant customer columns. Pick the appropriate column from each source based on defined survivorship rules. You will also need a table that looks like option 2 to manage the association and consolidation steps. As ngalemmo pointed out, you need a path back to where you came from. The value of this is simple, a single version of truth. If your customer wants to know the sales territory of customer xyz inc., what is the answer? With 2 records in your customer dim for the same customer with different answers, you can't really say for sure. If you try grouping sales/revenue by territory, the results could even be messier.

Also, don't be so quick to throw out source identifiers. Often, these keys can link customers across the organization.

ngalemmo Tue Jun 15, 2010 12:05 pm

BoxesAndLines wrote: With 2 records in your customer dim for the same customer with different answers, you can't really say for sure.

With what I was suggesting with the update key, this would not be the case. When it is determined that two or more rows are, in fact, the same customer, they would all be assigned the same update key appropriate to the data source from which updates are received. All rows with the same update key would contain the same attribute values. This is particularly useful when you are dealing with multiple source systems as well as MDM systems. (You may, in fact, have multiple update keys if different sets of attributes need to be maintained from different sources).

The basic problem, when you are dealing with multiple simultaneous sources and/or an MDM system is that customer associations are subject to change. People make mistakes when they define cross references and MDM systems associate based on rules and statistical inference. None of these methods are 100% correct at any time. If your dimension is designed to store one and only one row for each presumed customer, any correction to that presumtion becomes very difficult to undo for facts already loaded. By maintaining a row for each unique source system identity, the fact associations from that source based on that identity is always correct. You can then assign and modify a presumed identity at will based on cross references or MDM associations without loss of the true identity to which the facts are associated.

BoxesAndLines Tue Jun 15, 2010 1:39 pm

Party (customer) changes are a pain. I address party changes by not adding the Party dimension FK to the fact tables. I manage the relationships outside of the fact table which allows me to make Party changes without ever impacting the facts. Appropriate indexing and partitioning make the performance impact nominal.

VHF Tue Jun 15, 2010 2:43 pm

ngalemmo wrote:...and MDM systems associate based on rules and statistical inference.

Now I think I am starting to understand the underlying reasoning behind using separate records. If you have hundreds of thousands of customer records from a variety of sources, an MDM system would be making a best-guess decision as to whether “F. Rogers” is the same entity as “Fred Rogers”. And it might change its mind later based on additional information. In that scenario, separate dimension records make a lot sense.

In my situation I have only a few thousand customer records. These are manually entered and maintained in the ERP systems, and the data is pretty clean. I expect the correlation of customers between System1 and System2 to be pretty solid, although it is certainly possible that a mistake could be made. With <10K customers and <10M facts, I think updating fact records to correct a mistake would be an acceptable solution (not the case with a truly large DW.)

Both approaches seem viable for my situation. I want to ensure that users can never drill down to the point where a customer is “split” into multiple entities. With option 1 (single dimension record per customer) that is a given. With option 2 (separate dimension record for each customer for each source system) it should still be OK as long as (a.) all the records for a customer share the same values for all user-accessible attributes and (b.) the BI tools don’t expose the customer records as individual entities.

Sponsored content

Dimension loading sources

Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Choice schema

Re: Dimension loading sources

Thanks

simple question

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources

Re: Dimension loading sources