Overlapping facts

Al Wood Mon Jan 03, 2011 1:48 pm

Hi,

I have a fact table containing diagnoses. So far, fine.
I have been given a new dataset with more diagnoses, (from when our patients visited other hospitals) and some of these new records refer to diagnoses that I already have, and some do not. The new data has only a patient ID and a date (I'm hoping to receive more columns later) but the date cannot be used to decide if it's the same diagnosis or not. There may be two diagnoses on the same day for the same patient. On the other hand the diagnosis dates are often approximate, and are unlikely to match between the two datasets.

Ideally I would decide which is the better source of data, and just use that, but unfortunately the poorer dataset has much wider scope, and is in demand from end users.

Should I put them all into one fact table with a flag to indicate the origin of the data? Maybe with a Lineage or Audit dimension?

Many thanks in advance,
Al Wood.

ngalemmo Tue Jan 04, 2011 2:55 am

When dealing with multiple data sources, it is always a good idea to tag rows with a source identifier. Usually you have a load/audit dimension and the FK to it is placed on everything from that load (facts and dimensions).

As far as loading the data, it would seem to me you would want to capture each individual encounter with whatever diagnosis are associated with it. If you are trying to relate individual encounters into an episode of care, you would implement it by adding an identifier (i.e. assign an arbirary number) to the fact table so that encounters can be grouped. I would not attempt to alter the source data in any way.

You may also want to consider using DRG's rather than attempt to compare individual diagnosis. Differences in coding and precision by the physician make direct comparison difficult in some cases.

Jeff Smith Wed Jan 05, 2011 2:12 pm

I would keep the 2 fact tables separate. If you need to query both sets of data, create a union view between the 2 tables, excluding the records in the secondary source from your hospital.

I hate mixing data from different sources in a single fact table particularly when one is of substantially lower quality.

Al Wood Wed Jan 05, 2011 3:09 pm

Hi Jeff,

I'm afraid if I union the datasets I can't exclude the diagnoses from our hospital, because in the poorer dataset there is no indication of where the diagnosis occurred. It doesn't have that information yet! I have no indication of when it might come.

I read somewhere that data warehouses are good at highlighting gaps in data. I need a structure that allows this to happen!

Al Wood.

Jeff Smith Wed Jan 05, 2011 3:37 pm

Just curious... In the poorer dataset, how complete is the data from your hospital? I would still be relunctant to mix the 2 datasets in one fact table.

Gaps in dimensional data can be dealt with but gaps in fact data can't. For example, lets say I have a product dimension. I load the product dimension primarily from a product table I get from a source system. When I load the transaction data, I discover a product code that is not in the source system's product table (just an example). I have 2 choices - 1) I can assume the new product code is bad data and put a -1 in the product dimension key, or 2) I can assume that the product code is legit but just hasn't made it to the extract of the product table I get from the source system, in which case I add the new product code from the transaction data to the product dimension. When it finally shows up in the product table, I will update the other columns for the product code in the product dimension table.

If the objective is to identify the treatment patterns of patients based on diagnosis, then there may be an alternative solution. If there are a known number of diagnosis you could create a diagnosis dimension - maybe you are only interested in diabetes, cholesteral and other chronic diseases. You could create a junk dimension table with one column for each diagnosis (or a set of columns, maybe a score for severity, a discriptor, etc). You could create a snowflake design with the diagnosis dimension linked to the patient dimension, calling it "current diagnosis" or "current chronic disease". The diagnosis dimension key in the patient dimension could be updated from both sources of data. This type of design would enable a researcher to follow a patient with a particular disease overtime.

I work in dental and we only have 2 chronic diseases so it's a little simpler. I'd be interested in reading about how your DW is used. We are starting to increase the amount of "prevention" analysis that is done using our DW. It's really very simple - brush, floss, get cleanings and, for children, get sealants and fluoride. That's it. If people did that, their dental costs would drop dramatically.

Sponsored content

Overlapping facts

Overlapping facts

Re: Overlapping facts

Re: Overlapping facts

It's worse

Re: Overlapping facts

Re: Overlapping facts