Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

View previous topic View next topic Go down

Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  pzajkowski on Tue Jul 13, 2010 10:34 am

I recently read an interesting discussion within the Kimball forums about "Dealing with Duplicate Dimension Rows" wherein a regularly active participant of these forums, ngalemmo, made a couple of statements that shed new light on the concepts of dimensional data warehousing that hadn't occurred to me.

Essentially, he states that the "most important thing is to realize that integration and reporting off a dimensional structure has nothing to do with keys and everything to do with attributes." (To my own credit, it was just the other day that I had come to the same conclusion when I discovered that my colleagues were writing SQL queries with embedded dimensional keys rather than the attributes - yikes!)

But, it's a subsequent statement that ngalemmo made that has forced me to re-assess my assumptions about dimensional models: "do not rely on pre-process consolidation, but rather maintain the identity of the row to its source systems." This statement forced me to realize that I'd been assuming that for a given entity in a dimension, only one key should be "current" for that given entity. So, what does this mean about using a Master Person Index (MPI) - aka Enterprise Person Index (EPI) - which often functions by consolidating data to form a new unique identifier that crosswalks to the original source system identifiers.

I work with healthcare data where we rely heavily on patient identification between a vast array of data sources. This MPI for patients won't be going away since it's tied to an outside vendor that assists in the production of our healthcare portal. Nonetheless, I'm wondering if an MPI goes against the grain of a dimensional structure?

One other thing. If a best practice approach to populating/maintaining a dimension is to ensure that each source system key has its own dimensional key, and that reporting off of such a structure is dependent on attributes, doesn't a process still need to be in place that will "clean up" the attributes so that an appropriate set of dimensional rows are accessed? For example, if a customer from two different source systems is the same person, but with different spellings of the lastname (McDonald vs Mc Donald), wouldn't a process need to exist to ensure the lastname attribute is the same so that the two rows in Dim_Customer are leveraged if needing to query/report on all activities of this specific customer?

I'll expand with a real life situation. I'm trying to develop a dimension for Service Providers which can include doctors, nurses/mid-levels and facilities such as hospitals and labs. One of our data sources has over a dozen different source identifiers with a dozen different spellings for the same hospital. Presently, the only way to query the data for this hospital using SQL is to use a variety of "Like" clauses -- just insane. So, if pre-consolidating this provider to a single key is not the correct approach, per ngalemmo, wouldn't a process need to be in place to force agreement of the attributes instead so that a simple query of "Select count(admissions) from Claims where Provider = 'General Hospital' " can be executed?





Last edited by pzajkowski on Tue Jul 13, 2010 1:23 pm; edited 1 time in total

pzajkowski

Posts : 31
Join date : 2009-08-10

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  ngalemmo on Tue Jul 13, 2010 12:45 pm

I meant to say "do not rely on pre-process consolidation..." "reply" was a typo.

When dealing with alternate sources of identification, such as the MPI or an ID generated by a MDM system, I store that key as an alternate update key in the dimension. So, any given row in the dimension will have a natrual key, which ties it to the key of record from the source system, as well as one or more update keys that serve as keys to apply updates for specific attributes coming from an alternate source.

The relationship between natural keys and update keys is essentially the cross-reference. Should the alternate source change the identity, (which can happen in MDM systems, and possibly MPIs... but that would be a mess wouldn't it?), you simply adjust the update key assigned to that row. Subsequent updates from the alternate system will then adjust its attributes based on the changed identity.

The important thing is the relationship between facts and dimensions never change because those are based on the source identities. If an identity changes on a source system, that is not an issue as such a change must first be reconciled in the source system. A reassignment of transactions (orders, claims, etc...) must occur in the source system and those updates would be reflected as normal transaction activity in the data warehouse.

In the case of the McDonald/Mc Donald example, coming from two different source systems, the determinination of equivalency is either made by a third system (an MDM system for example), in which case there should be attributes that the third system maintains, or somebody changes the spelling in one of the systems.

In the case of the provider example, if there are different sources with different provide identities, simply have attributes for the cleansed names and expose those attributes for reporting. The important thing is not to "clean" the identifiers (provider ID, license number, etc...) as those are the natural keys needed to associate facts.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  pzajkowski on Tue Jul 13, 2010 1:23 pm

The typo was my error - my apologies.

Your mention of update keys is a bit of a new concept for me.

I need to play this out...
Health claims data includes a doctor from two different systems with different spelling variations of his name, but based on MDM rules is determined to be the same doctor. The MDM output would contain two rows that includes the source system IDs, the linking identifier between the two IDs generated from the MDM process, along with name attributes that correctly list the doctor's name. Thus, the Dim_Provider dimension would include both source system IDs, along with the MDM linking identifier as the Update Key, and the name attributes within the Dim_Provider dimension get stamped with the correct name attributes found in the MDM table. The Dim_Provider dimension generates its own ProviderKeys for each row, wherein these ProviderKeys are stamped in the ClaimsFact table according to the claims source data joining to Dim_Provider on the source provider IDs.

So, when querying for this doctor's data, I would leverage the doctor's attrubutes (name) that should be the same on both rows in Dim_Provider.

How does that sound?

Feel free to provide your own example if mine appears to be off.

pzajkowski

Posts : 31
Join date : 2009-08-10

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  ngalemmo on Tue Jul 13, 2010 1:45 pm

That's the jist of it.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  pzajkowski on Tue Jul 13, 2010 2:49 pm

Great!

Is the concept of Update Keys mentioned in Kimballs books?

pzajkowski

Posts : 31
Join date : 2009-08-10

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  BoxesAndLines on Tue Jul 13, 2010 3:10 pm

Do you have a MDM system? Most folks don't. If I had the same customer from multiple sources I would consolidate to a single row. It's really no different than address. I would manage the cross reference of these keys in another table. It's a complex process but well worth the reporting value. If you're not integrating the data then the ultimate value you can provide the business is diminished.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  ngalemmo on Tue Jul 13, 2010 3:33 pm

How is "putting everything on one row" integration? Does this mean a type 2 dimension can never "integrate" data, because the same entity exists on multiple rows?

Wither you have multiple rows to represent a time series or multiple rows to represent source identification (or even a combination of both) the result is the same. You can still provide an integrated view of the data because the attributes are maintained consistantly across versions.

The problem with putting everything on one row is that it is prone to errors that are irreversible when you are integrating disparate data sources (such as 2 independent order systems). It also ignores the whole point of using surrogate keys, that is to establish a stable relationship between facts and dimensions. My argument is that the only way to ensure a stable relationship over time is that the relationships be based on the business keys found in the source system of record.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  BoxesAndLines on Tue Jul 13, 2010 7:56 pm

ngalemmo wrote:How is "putting everything on one row" integration? Does this mean a type 2 dimension can never "integrate" data, because the same entity exists on multiple rows?
Come on ngalemmo.

ngalemmo wrote:Wither you have multiple rows to represent a time series or multiple rows to represent source identification (or even a combination of both) the result is the same. You can still provide an integrated view of the data because the attributes are maintained consistantly across versions.
If I know for a fact that the same customer exists in multiple sources I want that customer represented only once in my dimension. Same for Product, Address, Investor, Provider, whatever. Otherwise you're not even in first normal form.

ngalemmo wrote:The problem with putting everything on one row is that it is prone to errors that are irreversible when you are integrating disparate data sources (such as 2 independent order systems). It also ignores the whole point of using surrogate keys, that is to establish a stable relationship between facts and dimensions. My argument is that the only way to ensure a stable relationship over time is that the relationships be based on the business keys found in the source system of record.
My approach does add complexity to the ETL process but simplifies reporting. A worthwhile exchange. I'm not quite following the SK argument. I still use SK's to identify a single person/concept/thing. I have an additional table non reporting table to manage the SK/natural key relationships.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  pzajkowski on Wed Jul 14, 2010 11:36 am

If I know for a fact that the same customer exists in multiple sources I want that customer represented only once in my dimension. Same for Product, Address, Investor, Provider, whatever. Otherwise you're not even in first normal form.

This has been my usual tendency, one row per entity (especially Patients), but fairly recently I began questioning this approach when attempting to build a dimension of healthcare providers. It seems to me, now, that querying/reporting off of dimensional attributes doesn't necessitate the need to maintain a single row per entity in a dimension. Plus, I don't think first normal form has any relevance in dimensional modeling: every example in Kimball's books shows dimensions with all sorts of data being repeated.

One factor that makes my situation difficult, and may force me to go with the one-row-per-entity approach nonetheless, has to do with our patients being boiled down to single identifiers. If I change the way Dim_Providers (i.e., doctors) is populated to be many rows per physician (as ngalemmo generally recommends), I'm not sure if data will aggregate correctly.

For example, a table called Patient_Provider exists to reflect relationships between patients and their doctors, similar to a bridge table. Some of the data we receive downstream is rolled up to the single patient identifier (e.g., lab results). Presently, given that the patient-provider relationship is limited to a single row, I can be assured that an associated lab result will appear only once. But, if Dim_Providers is populated as multi-row per provider, then the relationship stamping in Patient_Provider will blow out to multiple rows, which means the lab result data will blow out to multiple rows. Our current structure in Dim_Provider of one active row per physician works for now because Dim_Provider is populated with only those doctors with access to our healthcare portal. We're in a new position of needing to integrate all providers into Dim_Provider who are not portal users -- the source of the additional healthcare providers comes from health insurance claims data where a plethora of source IDs exist for a single provider.

Consequently, as much as I'd like to consider implementing the technique of maintaining the Provider source keys in the dimensional structure (per ngalemmo), I may need to go with consolidating the providers from the claims data prior to loading them into Dim_Provider.

pzajkowski

Posts : 31
Join date : 2009-08-10

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  ngalemmo on Wed Jul 14, 2010 2:57 pm

But, if Dim_Providers is populated as multi-row per provider, then the relationship stamping in Patient_Provider will blow out to multiple rows, which means the lab result data will blow out to multiple rows.

If you are creating a summary table, then it is true, they will not summarize as well because of the additional foreign key values (much the same as you would encounter summarizing facts based on type 2 dimension keys). But, if your concern is there would be some form of cartesian explosion, that is not the case. The relationships would be based upon the surrogate primary keys, and while the same provider may exist 3 times in the dimension table (again, much the same as you would encounter in a type 2 scenario), the relationships themselves are distinct and no different had the patient had a relationship with three different providers. The existance of one, three or 50 instances of the same provider has no material effect on the results of the query because the joins are based on the surrogate FK relationships, not attributes.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  pzajkowski on Wed Jul 14, 2010 4:21 pm

I'm not concerned about a cartesian effect, but merely over counting.

Right now the Patient_Provider table has a simple one-to-one relationship between a patient (who has only one key) and a physician (who has only one key). If Dim_Provider is populated with more than one active row per physician, each patient-doctor relationship will be repeated for each provider key within the Patient_Provider table. So, if a physician has two active surrogate keys in Dim_Provider, then both keys would be associated with a given single patient key in Patient_Provider. Joining Patient_Provider to PatientLabResults by patient key will now yield double the rows since PatientLabResults records data per patient key.


This could get tricky when attempting to answer a question like "Which of Dr. Smith's patients have not had two A1Cs in the past 12 months?"
Things won't be problematic, though, if attempting to answer "When was the last time Dr. Smith's patients had an LDL screening?"

pzajkowski

Posts : 31
Join date : 2009-08-10

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  ngalemmo on Wed Jul 14, 2010 4:44 pm

It depends on how you count things... do you count based on attributes or on keys?

So, for "Which of Dr. Smith's patients have not had two A1Cs in the past 12 months?" you would count the number of A1C's over 12 months by MPI, patient ID or whatever you decide is the standard identifier... not patient surrogate key. Same thing with identifying the doctor.

It is the same thing with type 2 dimensions. If the patient dimension was type 2, and if the provider dimension was type 2, you would have the exact same situation (multiple keys representing the same patient or provider, multiple occurances of relationships, etc...), yet type 2 dimensions are well accepted and there has never been any concern about miscounting when querying using such a structure.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  BoxesAndLines on Wed Jul 14, 2010 8:23 pm

The type 2 is not an issue since there is an alternate key on the dimension to allow grouping across the versions of history. If you linked the multiple sources with a common identifier you could reproduce the same functionality.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  ngalemmo on Thu Jul 15, 2010 12:51 pm

BoxesAndLines wrote:The type 2 is not an issue since there is an alternate key on the dimension to allow grouping across the versions of history. If you linked the multiple sources with a common identifier you could reproduce the same functionality.

And a common identifier is an attribute of the dimension... which is my point... integration occurs through attributes, not dimensional primary keys.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Is a Master Person Index (MPI) in conflict with the concepts of a dimensional structure?

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum