Data Deduplication in Dimension Design

g8rpal Mon Oct 12, 2015 9:13 am

Hi Kimball fans!
I just wanted to write about a design scenario and some approaches and see if anyone could offer any insight, advice, or opinions about how they have seen similar situations modeled in their careers.
Our source system has a table for employees in which the employee code is unique per row; but, clearly there are cases where the same person has multiple employee codes. Should the employee dimension in the data warehouse have one row per employee code, or one row per employee? I would of course, say one row per employee, except in the case of a type 2 SCD, in which we would want one row per employee data warehouse natural key + effective date.
The first approach I am considering is based on Kimball’s design tip #67, Maintaining Back Pointers to Operational Sources (http://www.kimballgroup.com/2005/05/design-tip-67-maintaining-back-pointers-to-operational-sources/ )
Basically, he recommends that you keep a single cross reference table to hold all the original employee codes with 3 fields: data warehouse natural employee key, source system name, source system employee code. So in addition to my conformed employee dimension, I would have a table to hold these values.
I have also recently started working with Microsoft DQS / MDS and am working on how this would all work with the ETL as well. I embellished on this design tip a bit and my cross reference table starts in Staging with a data warehouse natural key that consists of the employee name + birth date + social security number, an employee code in, an employee code out, and a Ranking column in which only 1 row per employee DW natural key would have a 1. This is the surviving row. Simple business logic could rank them based on their creation date descending (for example) to make a first pass at de-duplication. This first pass could be very simple or more complex if one were comfortable with automated matching using the fuzzy grouping capabilities of SSIS. The records could then go through DQS cleansing and a data steward could do more complex de-duplication based on the tools it provides.
As I load the fact table, I have to go through the cross reference table with a business natural key matched on the employee code in to get the employee code out which is the surviving member.
For full auditing and traceability, I would also need a similar table to record the actual business natural keys for each row loaded in the fact table. If I am lucky enough to have a record # for each transaction this is simple; otherwise, I’d have to devise a similar DW natural key for the fact table.
In the semantic layer of the cube; I would preferably not expose the original employee code of the surviving employee record. If I had to, I would try to make it a property of the employee name attribute. If there were certain people that actually had to have the original employee code from the fact table, I could create a view that linked the fact row to the source system code records through an additional actions drill through controlled through role security. Other users drill through action would only show the surviving employee code.

Another approach (that I don’t particularly like) would be to make the employee code attribute an SCD type 2. In my case there were multiple occurrences where the same employee having the different codes being used on the same day so this would not be possible.
The third approach, which I don’t really like either; but, I think is acceptable, is based on something Kimball wrote in the Essential Steps for the Integrated Enterprise Data Warehouse, Part 2
http://www.kimballgroup.com/2008/04/essential-steps-for-the-integrated-enterprise-data-warehouse-part-2/ .

In this approach, I would assign a surrogate durable natural key for each employee. Essentially, the duplicated employee codes would roll up to the surviving record through attribute rollups. I would presumably also need a durable employee name, a durable employee primary job, etc., for every attribute worth rolling the duplicates up to. The advantage to this approach is that the ETL and the physical model is a little less complex and it’s easier for the end user to get back to the original employee code associated with the fact record.

The fourth approach would be to load the data into the dimension as one row per employee code. I do not think this is the right answer for many reasons; but, mainly because it would make it extremely difficult to answer the question of how many employees did this or that. The burden would be placed on the analyst or report writer to write complex logic to get a distinct count of people (not employee codes) associated with an event or dollar amount. I absolutely agree with Kimball when he said that “one of the most valuable services that the DW/BI system can provide is de-duplication of people and organizations.”

I am very interested to hear what the Kimball community has to offer on this subject. Thank you very much for your time.

Mike

ngalemmo Tue Oct 13, 2015 10:19 pm

It depends on what the employee code represents in the operational system. What does the business use to identify an employee? If the employee code is that identifier then the employee code should be the natural key. If it has some other purpose then it should not.

The natural key to a dimension should always be the identifier the business uses. This way transaction data from various sources can use this key to reference the dimension. Of course within the warehouse the natural key is replaced with a surrogate primary key on the dimension and facts.

If employee code is that key, there is no harm in storing multiple rows containing identical employee attributes in a type 1. Or you simply store data as is rather than try to standardize it. The attributes such as employee_id drive analysis, not the key. For example, you simply do stuff like count(distinct employee_id) to get an employee count.

The issue with a single image of an employee is a data integration issue, not a dimension issue. The whole point of a natural key and its surrogate is to 1: Identify a row from the source system point of view. And 2: maintain a stable relationship with facts tables that reference it. That's it. It doesn't matter what attributes there are or how their values are set. All that matters is the same row is always joined to the same facts. So, if by some means (such as a MDM system), you provide an accepted employee image, simply apply it to all dimension rows for that employee. You can also maintain original attributes if you like. These may be useful in evaluating the performance of the MDM system.

g8rpal Wed Oct 21, 2015 7:58 am

ngalemmo,

Thank you for your generous reply. You've given me a lot to think about. I also started to think about the additional complexity that would be required taking the first approach I described if the dimension participates in a many to many relationship. The approach to load the dimension at one row per employee code and rolling them up to a durable master record still bothers me conceptually for a couple of reasons...

1. If I want to get a distinct count of the employees (meaning the physical person not the IDs), I either have to have the codes roll up to a durable employee key, (durable employee name, etc.) and count that or have some other combination of attributes that ensure one row per person. It doesn't make the WH easy to use for an analyst in my opinion.

2. Later on, if we ingest another source system that happens to have employee IDs that overlap, we'd have to append an identifier or add another attribute to describe the source system. I just don't see much analytical value in providing an arbitrary unintelligent business code in the semantic layer. These codes are only useful during the data validation phase or for certain individuals within the organization that may have a real need to trace the row back to the original source system record.

#2 is and always will be an uphill battle. Many times in my career my first pass at building an Analysis Services cube has started a discussion with a business user because some source system code isn't an attribute of a cube dimension. The code was in the underlying table in the data warehouse. My reasoning was that we don't want to create aggregations on something that doesn't make sense to slice by or filter on from a birds eye point of view. If you need to see the detail you drill through to the relational source. Also, many times, I've had to relent because the business is so insistent that they need it. If I'm lucky I can convince them that the right decision is to make it a property of the name attribute so they can see it on their pivot table next to the name; but, unavailable for slicing or filtering and the unnecessary aggregations aren't stored.

The problem with the code-in / code-out design pattern I described first is that I have to create physical structures in the EDW to store this cross walk information and another to store the actual "code in" and "code out" for each row in the fact table. However, I think that is a better alternative from a purist sense that I want it to be easy to get distinct counts of the conceptual members of a dimension as I would imagine this would be what the business user would want more often then they would want counts of source system codes. Especially considering that the same person can have two different codes.

My plan would be to create two "additional actions" drill through abilities in the cube. The one for the majority of users would only present the surviving source system code and would perform the best. The other, would do the additional joins necessary to retrieve the original code in and the code out and it would be slower. It would only be used during data validation and in edge cases where there is a need for that knowledge.

In this particular case, I am likely going to take the approach you describe and loading the employees at one row per employee code. I have a durable employee key and attributes that the duplicate employees can roll up to. In this case, there aren't that many duplicates, this is the first iteration, and I can appreciate the business user's reluctance to let go of the employee code.

The next case is very different. We have a contact table that is the wild west. In talking to the developer there is nothing done on the front end to prevent a user from entering a new contact with the very same if not slightly different information and assigning a new contact code. If I perform a first pass de-duplication say by first, middle, last name + a nationally recognized unique identifier, one provider can have 8 different contact codes. Even worse, there isn't any rhyme or reason why a transaction would record one persons first contact code over the third. In this case, the business doesn't refer to contacts by their contact code and in this case I will likely use the code in code out method I described.

I'm still on the fence about the employee dimension; but, I think I am leaning towards what you describe. I'll probably change my mind again three times before the end of the day. Would you agree that the code in / code out method is better for the contact table?

Again, I really appreciate your thoughtful response as you've given me a lot to think about. This is an interesting source system that's for sure and this is going to be a fun challenge to improve their data quality and provide a powerful platform for analytics.

As a side note, are there any other dimensional modeling forums you recommend? I am really going to miss this one if and when it goes away.

Mike

Sponsored content

Data Deduplication in Dimension Design

Data Deduplication in Dimension Design

Re: Data Deduplication in Dimension Design

Re: Data Deduplication in Dimension Design

Re: Data Deduplication in Dimension Design