Overlapping facts
3 posters
Page 1 of 1
Overlapping facts
Hi,
I have a fact table containing diagnoses. So far, fine.
I have been given a new dataset with more diagnoses, (from when our patients visited other hospitals) and some of these new records refer to diagnoses that I already have, and some do not. The new data has only a patient ID and a date (I'm hoping to receive more columns later) but the date cannot be used to decide if it's the same diagnosis or not. There may be two diagnoses on the same day for the same patient. On the other hand the diagnosis dates are often approximate, and are unlikely to match between the two datasets.
Ideally I would decide which is the better source of data, and just use that, but unfortunately the poorer dataset has much wider scope, and is in demand from end users.
Should I put them all into one fact table with a flag to indicate the origin of the data? Maybe with a Lineage or Audit dimension?
Many thanks in advance,
Al Wood.
I have a fact table containing diagnoses. So far, fine.
I have been given a new dataset with more diagnoses, (from when our patients visited other hospitals) and some of these new records refer to diagnoses that I already have, and some do not. The new data has only a patient ID and a date (I'm hoping to receive more columns later) but the date cannot be used to decide if it's the same diagnosis or not. There may be two diagnoses on the same day for the same patient. On the other hand the diagnosis dates are often approximate, and are unlikely to match between the two datasets.
Ideally I would decide which is the better source of data, and just use that, but unfortunately the poorer dataset has much wider scope, and is in demand from end users.
Should I put them all into one fact table with a flag to indicate the origin of the data? Maybe with a Lineage or Audit dimension?
Many thanks in advance,
Al Wood.
Al Wood- Posts : 46
Join date : 2010-12-08
Re: Overlapping facts
When dealing with multiple data sources, it is always a good idea to tag rows with a source identifier. Usually you have a load/audit dimension and the FK to it is placed on everything from that load (facts and dimensions).
As far as loading the data, it would seem to me you would want to capture each individual encounter with whatever diagnosis are associated with it. If you are trying to relate individual encounters into an episode of care, you would implement it by adding an identifier (i.e. assign an arbirary number) to the fact table so that encounters can be grouped. I would not attempt to alter the source data in any way.
You may also want to consider using DRG's rather than attempt to compare individual diagnosis. Differences in coding and precision by the physician make direct comparison difficult in some cases.
As far as loading the data, it would seem to me you would want to capture each individual encounter with whatever diagnosis are associated with it. If you are trying to relate individual encounters into an episode of care, you would implement it by adding an identifier (i.e. assign an arbirary number) to the fact table so that encounters can be grouped. I would not attempt to alter the source data in any way.
You may also want to consider using DRG's rather than attempt to compare individual diagnosis. Differences in coding and precision by the physician make direct comparison difficult in some cases.
Re: Overlapping facts
I would keep the 2 fact tables separate. If you need to query both sets of data, create a union view between the 2 tables, excluding the records in the secondary source from your hospital.
I hate mixing data from different sources in a single fact table particularly when one is of substantially lower quality.
I hate mixing data from different sources in a single fact table particularly when one is of substantially lower quality.
Jeff Smith- Posts : 471
Join date : 2009-02-03
It's worse
Hi Jeff,
I'm afraid if I union the datasets I can't exclude the diagnoses from our hospital, because in the poorer dataset there is no indication of where the diagnosis occurred. It doesn't have that information yet! I have no indication of when it might come.
I read somewhere that data warehouses are good at highlighting gaps in data. I need a structure that allows this to happen!
Al Wood.
I'm afraid if I union the datasets I can't exclude the diagnoses from our hospital, because in the poorer dataset there is no indication of where the diagnosis occurred. It doesn't have that information yet! I have no indication of when it might come.
I read somewhere that data warehouses are good at highlighting gaps in data. I need a structure that allows this to happen!
Al Wood.
Al Wood- Posts : 46
Join date : 2010-12-08
Re: Overlapping facts
Just curious... In the poorer dataset, how complete is the data from your hospital? I would still be relunctant to mix the 2 datasets in one fact table.
Gaps in dimensional data can be dealt with but gaps in fact data can't. For example, lets say I have a product dimension. I load the product dimension primarily from a product table I get from a source system. When I load the transaction data, I discover a product code that is not in the source system's product table (just an example). I have 2 choices - 1) I can assume the new product code is bad data and put a -1 in the product dimension key, or 2) I can assume that the product code is legit but just hasn't made it to the extract of the product table I get from the source system, in which case I add the new product code from the transaction data to the product dimension. When it finally shows up in the product table, I will update the other columns for the product code in the product dimension table.
If the objective is to identify the treatment patterns of patients based on diagnosis, then there may be an alternative solution. If there are a known number of diagnosis you could create a diagnosis dimension - maybe you are only interested in diabetes, cholesteral and other chronic diseases. You could create a junk dimension table with one column for each diagnosis (or a set of columns, maybe a score for severity, a discriptor, etc). You could create a snowflake design with the diagnosis dimension linked to the patient dimension, calling it "current diagnosis" or "current chronic disease". The diagnosis dimension key in the patient dimension could be updated from both sources of data. This type of design would enable a researcher to follow a patient with a particular disease overtime.
I work in dental and we only have 2 chronic diseases so it's a little simpler. I'd be interested in reading about how your DW is used. We are starting to increase the amount of "prevention" analysis that is done using our DW. It's really very simple - brush, floss, get cleanings and, for children, get sealants and fluoride. That's it. If people did that, their dental costs would drop dramatically.
Gaps in dimensional data can be dealt with but gaps in fact data can't. For example, lets say I have a product dimension. I load the product dimension primarily from a product table I get from a source system. When I load the transaction data, I discover a product code that is not in the source system's product table (just an example). I have 2 choices - 1) I can assume the new product code is bad data and put a -1 in the product dimension key, or 2) I can assume that the product code is legit but just hasn't made it to the extract of the product table I get from the source system, in which case I add the new product code from the transaction data to the product dimension. When it finally shows up in the product table, I will update the other columns for the product code in the product dimension table.
If the objective is to identify the treatment patterns of patients based on diagnosis, then there may be an alternative solution. If there are a known number of diagnosis you could create a diagnosis dimension - maybe you are only interested in diabetes, cholesteral and other chronic diseases. You could create a junk dimension table with one column for each diagnosis (or a set of columns, maybe a score for severity, a discriptor, etc). You could create a snowflake design with the diagnosis dimension linked to the patient dimension, calling it "current diagnosis" or "current chronic disease". The diagnosis dimension key in the patient dimension could be updated from both sources of data. This type of design would enable a researcher to follow a patient with a particular disease overtime.
I work in dental and we only have 2 chronic diseases so it's a little simpler. I'd be interested in reading about how your DW is used. We are starting to increase the amount of "prevention" analysis that is done using our DW. It's really very simple - brush, floss, get cleanings and, for children, get sealants and fluoride. That's it. If people did that, their dental costs would drop dramatically.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Similar topics
» Inventory facts and production facts
» Overlapping Dates in a dimension
» How best to model Timesheet facts against Sales Order facts
» Multiple Facts or Single Facts and Status Table?
» In which layer to relate facts to other facts?
» Overlapping Dates in a dimension
» How best to model Timesheet facts against Sales Order facts
» Multiple Facts or Single Facts and Status Table?
» In which layer to relate facts to other facts?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|