Bridge tables versus massive junk dimensions

View previous topic View next topic Go down

Bridge tables versus massive junk dimensions

Post  ejb11235 on Tue Jun 15, 2010 2:32 pm

I am designing a star schema to store information on hospital stays. This is my first real star schema and things are off to a good start.

  • My fact table is hospital stays.
  • I have a diagnosis dimension.
  • There are multiple diagnoses per stay. Costs are not allocated per diagnosis.


My first design approach was to use a "DiagnosisGroup" bridge table. There would be a Many-to-1 relationship between Hospital Stays and Diagnosis Groups, and 1-to-Many relationship between Diagnosis Groups and Diagnoses. So far so good ... until I wrote my first query:

select count(*) as NStays
from HospitalStayFact HSF, DiagnosisGroup DG, DiagnosisDim DD
where (HSF.DiagnosisGroupKey=DG.DiagnosisGroupKey) and (DG.DiagnosisKey=DD.DiagnosisKey)
;

Oh wait! ... this is going to cause each hospital stay to be counted not once but once for every diagnosis code! So if I use a DiagnosisGroup approach, am I forced into using a subquery, or writing a query like the following?

select count(*) as NStays
from HospitalStayFact HSF,
(select distinct DiagnosisGroupKey from DiagnosisGroup DG join DiagnosisDim DD on DG.DiagnosisKey=DD.DiagnosisKey) SQ
where (HSF.DiagnosisGroupKey=SQ.DiagnosisGroupKey)
;


So then, it occured to me that maybe I was misusing the bridge table concept anyway, and that I should be using a massive junk dimension, with one column per diagnosis code. Which, if you've worked with health care data, means I have thousands of columns, and since they won't fit into a single table, means I'll have many tables. Which I'm not totally adverse to ... it gives me great performance with Infobright, but my build programs are going to be interesting.

Thoughts?

Thanks in advance!

--eric

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ngalemmo on Tue Jun 15, 2010 4:01 pm

"This damned computer,
I wish that they would sell it.
It never does what I want,
Only what I tell it."

If you build a query to count stays by diagnosis, what would you expect the result to be? It has nothing to do with the bridge and everything to do with what you are asking.

A more common business question would be "What are the number of stays by PRIMARY diagnosis", such a count makes sense, as there is only one such diagnosis per stay.

What your model fails to do is properly identify the primary diagnosis. There are two ways to do this... either add a flag or count in the diagnosis bridge table on which you can filter, or, have a primary diagnosis foreign key to the diagnosis dimension on the fact table.

Since primary diagnosis is an important descriminator in clinical analysis, often both are done.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ejb11235 on Tue Jun 15, 2010 4:20 pm

I am aware that my model does not currently identify the primary diagnosis, and I will need to address that issue at some point. But that is not the design problem I am currently focusing on. The problem I am focusing on right now is the requirement that I be able to select hospital stays based on the presence of diagnosis codes, not necessarily primary diagnosis codes.

I have been focusing on the wonders of the star schema and how simple the queries were. In my rush to build my pilot database, I completely overlooked the issue of multiple counting, even though I had read about it in several places. When I looked at the SQL required to solve the problem when bridge tables are used, I was a bit taken aback. More so, I became concerned about the ability of business intelligence/data warehousing tools to deal with this automatically. (I haven't begun to work with these tools yet, so my understanding of how these tools are configured is probably quite muddled.)

In any case, my question, although it can be expressed as a concrete design challenge, is more general than specific. And that question is, when you have thousands of attributes for each fact, is it "better", in your experience to use a bridge table approach and deal with the resulting complexities in the queries, or is it "better" to use discrete columns for each attribute and deal with the resulting complexity in the table design, data loading processes, and possibly increased meta data?

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ngalemmo on Tue Jun 15, 2010 4:36 pm

You don't have 'thousands' of attributes. You have a handful of diagnosis codes.

The question becomes, what exactly are you trying to do? Are you trying to count stays or are you trying to count stays that have some kind of diagnosis? Is it possible to have a stay that does not have a diagnosis?

"The problem I am focusing on right now is the requirement that I be able to select hospital stays based on the presence of diagnosis codes, not necessarily primary diagnosis codes."

The problem is your model lacks the necessary information to resolve your query. If your intent is to not count stays that have no diagnosis, then you need a diagnosis sequence number (1, 2, 3...) (where 1 is primary) on the diagnosis group table then filter for seq number = 1. You do not need to join to the diagnosis table. A 'primary diagnosis' flag would have the same effect. If a stay doesn't have a primary diagnosis, it would be resonable to assume is has no diagnosis.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ejb11235 on Tue Jun 15, 2010 5:26 pm

Oh boy, we're not communicating ... let me see if I can clarify ...

#1 -- when I said "presence of diagnosis codes not primary codes" ... all I meant was that right now I don't care if the diagnosis code is primary or not primary. I did not mean I was interested in the number of diagnosis codes.

#2 -- wrt do I have a handful of diagnosis codes or thousands of attributes. I agree with you from a business modelling perspective. Unfortunately, I omitted some information from my original posting that might have clarified things a bit:

I am looking for fast query response times. By "fast" I mean under a minute ... but 15 seconds would be even nicer. I have been doing a pilot study in order to determine what kind of query response time I can get, and how much work it takes to get it. So for the last two weeks I have been doing some performance testing using Infobright Community Edition. I created my database schema and loaded it with over 40 million facts. The focus in my testing has been speed, speed, and more speed.

My test queries were along the lines of "how many hospitalizations were there where a diagnosis code of 245.12 was present?", or "...where the diagnosis code 245xx is present", or "...where a diagnosis code of 245.01, 250.43, or 263.03" was present.

When I use a bridge table approach I get responses time in the 30 to 40 second range. If I create separate flag columns for each diagnosis code, I have been able to achieve response times as fast as six seconds. The response times in Infobright are highly dependent on both the number of rows that meet my search criteria, as well as the number of columns used in the query. So a database design where I have, for example, 16 diagnosis code slots, will be very slow because of the number of columns that will be referenced in the query.

I have also considered using aggregates, but I don't think that's going to work, for reasons I won't get into here.

My testing has revealed that Infobright can perform amazing compression on the data. For example, it took the data stored in a 2.5GB tab-delimited file and stored it in 1.35MB. The bad side of this is that the CPU cores are pegged when Infobright processes queries, so I can't increase performance by simple things like disk striping. Infobright (at least the community edition) processes queries single-threaded, so simply throwing a multi-core processor at the problem doesn't gain me anything. I could, of course, partition my database, most likely by year, and then run a different query on each set of tables simultaneously, and then roll-up the results in my application. This may very well be the approach I eventually take.

For now, I have decided to set my client's expectations using the performance numbers I get with a bridge table design. Hopefully the numbers will be good enough that he wants to move forward with the project.

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ngalemmo on Tue Jun 15, 2010 5:53 pm

Implementing a table that has one column for every possible diagnosis code is not a workable solution. Besides, which system do you plan to use? ICD-9 or the upcoming ICD-10?

All you need to do is add an attribute to the bridge so you can easily limit a query to one row per group without needing a subquery or self-join on the bridge itself for that particular query scenario.

With a bridge it is very easy to implement efficient combinatorial queries (i.e. (x diagnosis and y diagnosis) or (z diagnosis)) without resorting to some bizzare data structure.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ejb11235 on Wed Jun 16, 2010 11:50 am

"All you need to do is add an attribute to the bridge so you can easily limit a query to one row per group without needing a subquery or self-join on the bridge itself for that particular query scenario."

I've been trying to figure out what you mean by this ... my first thought is that you mean to add an additional column to the bridge table ... perhaps a 0/1 flag ... that's set to 1 for only one record in each group. Could you show me a sample query that demonstrates how the attribute you're suggesting would be used?

BTW, the data I am using has ICD9 codes in it.

Thanks,

--eric

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ngalemmo on Wed Jun 16, 2010 1:00 pm

Your bridge table would be something like:

diag group key
diag key
primary diag flag (Y = primary diagnosis)

If a stay has diagnoses, there is always one and only one primary diagnosis. It is always important to know the primary diagnosis when doing any clinical analysis, so having such a flag should be a given for any design dealing with medical history.

So, if you only want to count a stay once by any diagnosis, it is usually counted based on the primary diagnosis.

But, if you want to count stays for some list of specific diagnosis, stays will be counted more than once if the stay has more than one diagnosis on the list, unless you construct the query differently. The challenge here is, if you don't count the stay more than once, which diagnosis do you use to report the stay? Which, again, is why the primary diagnosis is so important.

Another attribute commonly used in analysis is the DRG (a.k.a. Hopkins Grouper) which circumvents this entire issue.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ejb11235 on Wed Jun 16, 2010 1:31 pm

Thanks!

I have been providing consulting services to my clients in the pharmaceutical industry for over ten years. While they sometimes focus their attention on the primary diagnosis, more often than not, in my experience, they do not restrict their searches to looking only at the primary diagnosis in their analytical projects. Whether or not that's always good science is a worthy question. However, often they're looking for side effects and co-mortalities, so they definitely do not want to limit their attention only to the primary diagnosis.

--eric

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  davidjlutz on Thu Jun 17, 2010 6:12 pm

Here's a suggestion. (I wish I could just draw an ERD for this ...)

Use three tables. Call them whatever makes sense to you. But I would say that you must keep (or build) a diagnosis code dimension. The question, as I see it, is how best to store the metrics. That requires two tables.

Definitely DON'T have separate columns for each diagnosis code in any table. Have one column in the diagnosis dimension for the unique codes. (You may also have related surrogates for those but let's ignore that for now - it's at the next lower level of modeling.) And don't have any columns in your hospital stays fact table for diagnosis - not even for primary diagnosis.

Your fact table will contain only the data related to the hospital stay - less diagnosis codes - with a unique identifier. If the fact table is even a little wide, and its probably very wide, my following suggestion (to have separate records for each diagnosis code) would probably not suit you well if you tried to implement it in a single fact table.

Then have a very simple (read: narrow) table in the middle - I used to refer to these as "zipper" tables as they zipped the facts to the dimensions (to resolve the many-to-many nature of the relationship) - that contains the hospital stay identifier and a single diagnosis code: enter one record for each diagnosis code related to that stay. Perhaps you could add another field designated for primary diagnosis or any other designation that would be used in your queries, reports and analyses.

The downside to this is that you have now added a third table to be joined. And it will be deeper than the main fact table but very narrow.

Because you are using Infobright's columnar database, this table will also benefit greatly from the compression of (mostly) sequential hospital stay identifiers and frequently repeating diagnosis codes.

The main key to adequate performance with this model, and with Infobright, will be to apply relevant filters to the "largest" table - that table from which you will select the most column element values - in the specifics of the query. For example, five columns of only 10,000 records from the fact table is more than 2 columns of 20,000 records from the "bridge" table. So apply filters to the fact table, if possible, in this (and most) examples.

davidjlutz

Posts : 1
Join date : 2010-06-17

View user profile

Back to top Go down

bridge table versus zipper table

Post  ejb11235 on Thu Jun 17, 2010 7:41 pm

Let's see if I understand the difference between what I am calling the bridge table and your suggestion of the zipper table. Both use an intermediate table between the fact table and the diagnosis table in order to implement the many-to-many relationship between facts (hospital stays) and diagnosis codes.

The bridge table and the zipper table (if I understand your suggestion) both have two columns, and differ only in one of those columns.

FactTable--BRiDGE
Dimension1Key
Dimension2Key
DiagnosisGroupID

BridgeTable
DiagnosisGroupID
DiagnosisDimensionKey

FactTable--ZIPPER
FactKey
Dimension1Key
Dimension2Key

ZipperTable
FactKey
DiagnosisDimensionKey

The bridge table approach requires a lot more processing to load, because it attempts to have only one diagnosis group for each unique combination of diagnosis codes. But the advantage is that this cuts down on the number of rows in the table. Over time one would expect the bridge table to grow at a slower pace. The ZipperTable contains a record for each diagnosis code assigned to each fact. It is far easier to construct but will grow over time at the same pace as the fact table. I think because of this, the BridgeTable approach would provide better performance, at the cost of a more complicated ETL process.

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  LAndrews on Thu Jun 17, 2010 8:52 pm

From what I can interpret, what you are looking for is the ability to obtain a count of distinct hospital stays based on a dynamic set of diagnosis codes.

I think the traditional bridge model would work if the filter was only a single diagnosis code. For a specific hospital stay, a single diagnosis code will only have a single record in the bridge table --> therefore no double counting issue.

The challenge is, when multiple diagnosis codes are selected, there is a possibility of multiple bridge table records existing for a given hospital stay, resulting in the results you were seeing.

Your sub-query was doing the distinct logic for you , ensuring the database only joined your resultset to the fact table once for each stay.

Performance of this type of query will probably depend on both the database structures (indexes, partitions etc) as well as the query tool being used.

The zipper table is effectively a new fact table, with the grain being diagnosis per hospital stay. Given the selection of multiple diagnosis, you are still going to require distinct logic to get the desired results.

LAndrews

Posts : 132
Join date : 2010-05-13
Location : British Columbia, Canada

View user profile

Back to top Go down

Diagnosis Group Bridge Table

Post  Alan Musnikow on Thu Jun 17, 2010 11:13 pm

A Diagnosis Group Bridge table is recommended by Ralph Kimball and Margy Ross on pages 262 to 265 of The Data Warehouse Toolkit Second Edition.
Despite the need to avoid double-counting, which is well handled toward the end of the original posting, and the less than ideal query response time, I do not know of a better alternative to the bridge table.
I do not think ejb11235 is "misusing the bridge table concept" in any way.
avatar
Alan Musnikow

Posts : 6
Join date : 2010-06-17
Location : Lexington, Massachusetts, U.S.

View user profile http://musnikow.com/

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ejb11235 on Fri Jun 18, 2010 12:52 pm

At this point I will be going with the bridge table. Although the diagnosis code selection portion of the query significantly slows my response time, I can run that sub-query once and place the results in a temporary table which is then joined to the fact table. This will keep my tables and metadata simple and keep me from having heroic ETL processes. Although the initial query (the diagnosis selection portion) will take longer than I would like it to, I can design the user interface in such a way that hides this. For example, instead of simply allowing the user to enter a comma-delimited list of diagnosis codes, I can have them enter them in one by one ... each time I can display the number of stays in the database having that diagnosis code. In other words, I will distract the user with bright shiny objects but my real intent is to keep them from having to sit on their hands for a minute while the query runs! Once the diagnosis code selection results have been stored in the temporary table, subsequent queries into the fact table will be rapid.

This has been my first serious foray into dimensional modelling, and I didn't want to make a newbie's mistake of missing a good solution for lack of asking. Thanks for your help.

--eric

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ngalemmo on Mon Jun 21, 2010 12:05 pm

avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  ejb11235 on Wed Jun 23, 2010 12:40 pm

Thanks for the link ... it's also good to know there are other "hey I thought I was kind of smart but this is confusing me" people out there who thought the star schema was the solver-of-all-problems and would eliminate all complex queries.

ejb11235

Posts : 10
Join date : 2010-06-15

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  larryp7639 on Mon Jul 05, 2010 4:54 am

LAndrews wrote:

I think the traditional bridge model would work if the filter was only a single diagnosis code. For a specific hospital stay, a single diagnosis code will only have a single record in the bridge table --> therefore no double counting issue.


I also think so.


__________________
Watch Predators Online Free

larryp7639

Posts : 1
Join date : 2010-07-05

View user profile

Back to top Go down

Re: Bridge tables versus massive junk dimensions

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum