Creating Crosswalk Tables for Multiple Data Sources

mhoover Thu Jan 19, 2012 9:33 am

Good morning all,

I am currently building a dw consisting primarily of medical data. We have several disparate systems in house that contain the same type of information (for this example, I'll use Person). In addition, we regularly receive data from several external organizations that is needed in our reports. In some cases there is already a linking table in place. My question is this:

Should I be creating a crosswalk table (Master Index) to assign a unique value for each entity, regardless of any currently existing ID in place? The thought is, if we switch to a new software solution and migrate the patients, we can link the new IDs assigned in that system to the ID assigned in the Master Index. In most cases the new Master Index ID will only be stored in the crosswalk table and the Patient dimension. All other tables will be linked via a unique surrogate key (dPatient is a slowly changing dimension).

In the future, should we discover duplicate records in the crosswalk table, we will be able to change the Master Index ID to match the previously existing record and allow referential integrity to update the key anywhere it might be used in the system, essentially allowing two separate slices of data for the same entity to become one.

Does this seem like a good practice, especially considering we receive a multitude of data from outside sources all with unique identifiers?

Your feedback is appreciated.

Sincerely,

Matthew Hoover

ngalemmo Thu Jan 19, 2012 11:21 am

Is the intent to store a person only once in the dimension table? How would you build the cross reference if there isn't a consistent key among the sources?

mhoover Thu Jan 19, 2012 11:42 am

The primary goal is to have one record for each entity when enough data is available to uniquely ID that entity, but allow the warehouse enough flexibility to create new entities without relying on the top hierarchy ID from the source tables.

In some cases there won't be enough data available to link an individual entity's multiple Person records together. However, as more data is gathered or delivered, it may be possible (for example, a SSN is received). What is used to link multiple records from separate systems will likely be decided by the business units that "own" the data.

In some cases we have a pre-existing link table that ties two or more systems together, but the top hierarchy ID is still unique to one of the systems and could change if a new system is brought online and the records are transferred to that system.

Here is an example using Patient Entity:

SystemA is is the origin point for internal Patients, SystemB represents patient data from an external source that may or may not contain patients from systemA.

SystemA patient table:
ID LastName FirstName SSN
1 Simpson Homer 000-00-1234

SystemB patient table:
H1 Simpson H 000-00-1234
M1 Mouse M 123-00-0000

Before data from SystemB is loaded, the MasterIndex Table would look like:
mik SourceSystem SourceSystemKey_INT SourceSystemKey_Var
123 SystemA 1

After SystemB is loaded and SystemB patients are mapped to SystemA patients by SSN, the MasterIndex table would look like:
mik SourceSystem SourceSystemKey_INT SourceSystemKey_Var
123 SystemA 1
123 SystemB H1
124 SystemB M1

When the patient dimension is created, we would use the mik as a business key instead of the source system key.

In the event that the patient has two distinct records in the MasterIndex and enough data is received to link the two, the mik can be changed for one record. Referential integrity rules would then update the mik in the Patient dimension and the current indicator flag would be set to N. This process would effectively link ALL of the fact data for that patient to a single patient entity.

Does that make sense?

Does that make sense?

mhoover Thu Jan 19, 2012 11:56 am

Let's say that linking data between a minimum of two systems will be delivered with the raw data during the ETL process.

Since a patient won't always exist in one system and the warehouse should be system agnostic, a business key of some sort needs to be generated as the top level of a entity's hierarchy.

If the entity doesn't have a matching record in the Master Index, either from the current system being loaded or another system via link data, a new Master Index record is created.

ngalemmo Thu Jan 19, 2012 12:36 pm

If you are receiving different information from different sources without any clear, precise identification, I would not attempt to consolidate them into a single dimension row. Create rows for each unique source/natural key instance and add 'consolidation' attribute columns to the dimension table. The consolidation attributes will contain values (such as internal patient identifier (i.e. the xref), name, SSN, and other patient attributes) that are the same across all rows with the same internal patient identifier. Using these attributes in queries will consolidate facts. Consolidation errors can be correct by updating the internal identifier on the dimension row (and the other consolidation attributes) without affecting fact table FKs.

mhoover Thu Jan 19, 2012 1:29 pm

I believe that is the effect I have in mind when there is no link information available.

The internal identifier you mention is what I have in mind when referring to MasterIndexID.

For instance, we have three systems that all share the same Person table. Two of these systems use different key values from the same table, and two of them use the same key but with a different name. These systems are all under external vendor control and can be changed at any time (which is why we have the three ID mess).

In my head I envision one internal identifier relating to three SourceSystem records. When doing the ETL, the key from the source system is linked to the internal identifier and data is loaded into the dw with that identifier. In addition, a SourceSystem attribute would be contained with each fact and dimension (assuming all delivered data comes from one system, which in most cases, it will).

The key from the source system would also be needed in the dimension as end-users may use it to filter values on a report.

If I'm understanding you correctly (and I could be way off), you are saying it would be best to use a composite key of the Origin System and Origin System's Unique Patient ID for the business key. An additional attribute would be added to the dimension to link all instances of a unique Patient together. This would allow a user to query based on person and source system as well as all records for that patient regardless of originating system.

So a simplified view of the dimension would look something like this:

SurrogateKey (Serialized ID)
BusinessKey (OriginatingSystem & Originating System's PatientID, needed for Data Manager to perform updates and track Hx)
OriginatingSystem
OriginatingSystemPatientID
MPI (Master Patient Index, a patient's unique internal identifier within the dw)
DOB
DOD
SSN
LastName
FirstName
FullName
.......
etc
.......

If it's discovered in the future that two dimension entries belong to the same entity, we simply change the MPI of one to reflect the other.

Does that sound accurate?

mhoover Fri Jan 20, 2012 9:34 am

OK, now that I've slept on it (more like dreamed about it) and read some other posts that are related to my situation, I can see the benefits of maintaining multiple entries in the patient dimension for the same entity.

I'm looking at modeling the dimension with a business key consisting of both the source systemID (integer derived from a lookup table) and the unique identifier from that system. In addition there will be a MPI (Master Patient Identifier) table created that links the multitude of patient entities from the disparate systems together. This MPI will be unique to our internal organization, regardless of vendor software or data source. The MPI will then be stored as an attribute in the dimension.

In the future, if it's discovered that multiple entries do belong to the same entity, it would simply require changing that attribute's value.

The benefit of this design will allow users to easily see the data as it comes from the sources system(s) but also allow for all data from all systems to be linked based on the MPI.

Sponsored content

Creating Crosswalk Tables for Multiple Data Sources

Creating Crosswalk Tables for Multiple Data Sources

Re: Creating Crosswalk Tables for Multiple Data Sources

Re: Creating Crosswalk Tables for Multiple Data Sources

Re: Creating Crosswalk Tables for Multiple Data Sources

Re: Creating Crosswalk Tables for Multiple Data Sources

Re: Creating Crosswalk Tables for Multiple Data Sources

Re: Creating Crosswalk Tables for Multiple Data Sources

Re: Creating Crosswalk Tables for Multiple Data Sources