Aggregates with Distinct

Maverick Sat Jun 05, 2010 12:27 pm

Hi all, I am currently wroking on a Star Schema [with few dimensions and one fact :-)] but ran into some issues regarding Unduplicated Aggregates.
Let me explain:
we have the following dimensions :
Ofices
Staff
Services
Time

Fact:
Total Customers Served [Unduplicated]
Total Services provided

Problem we have is the customer Id [which is a character id like AS001,AS002 etc..] and can be served by same staff with two different service types. So, when I do an aggregate at office,staff level, I need an unduplicated count of these customers which seems not possible with the way we usually do the aggregates. Did anyone ran into these kind of issues ? If so how did you correct?
We are using Oracle Database and Business Objects as presentation layer. So, If I put a numerical value[count (distinct customer_id)] for lowest aggregate level, and when uses roll it up to next level, it just adds up the numbers which includes duplicates. I hope I am able to explain it clearly..Please let me know if I need to add any more details to make it more clear..

BrianJarrett Sat Jun 05, 2010 1:09 pm

Most likely you'll have to drop service type if you want to roll up to a single row per customer. If you don't want to drop it all together you could roll it to a single service type (you could choose the last one, for example) or something similar. That might not have any value though, depends on your requirements.

In BO you'd then have to set your service type dimension as incompatible with your new higher level aggregate table (one row per customer, without service type). Your users would still be able to get the benefits of your aggregate table, as long as they don't need service type. If they do, then they'll have to go to the lower level fact table to get those numbers (one row per customer per service type). Of course you'll set up agg aware will do this for them. Then it's an education issue, letting them know they'll have to wait a bit for that level of data.

If you don't like that idea then you could also build a bridge table (because this new fact table causes a many to many between customer and service type) that would join your aggregated row (one per customer) to multiple service types. I normally avoid these; bridge tables are a real pain to maintain and require more work for end users when building reports. Plus you lose visibility as to exactly how your aggregated totals break out by service type.

Let me know if this helps.
Brian

Maverick Sat Jun 05, 2010 4:53 pm

Brian, thank you very much for this reply. I appreciate your help. However, service type is just an example. Same customer could be served in two different offices,as well. so when the report is rolled up for office level, it should not duplicate this customer. So, as per your suggestion, I have to drop staff and service as well. I could be having more than these three dimensions. This sample I mentioned is just that, a sample. it could be more complex.

Sorry!! If I misunderstood your response.

Regarding Bridge table solution, Can you explain little more in detail as to how can I get it done? You mentioned it's hard to maintian and other disadvantages..but just want to see how it helps in my situation.

ngalemmo Sat Jun 05, 2010 10:41 pm

The number of customers served is simply not an additive measure, and, other than create a variety of aggregates at different levels, their isn't much you can do about it. However, if customer is a dimension, you should not have a problem with count(distinct customer), other than query time. The trick with BOBJ is to not declare an aggregation method for the object. If you don't, and the user drills up and down the report, WEBI will re-query the source which should give you correct results. The downside is it can be pretty slow.

Maverick Sun Jun 06, 2010 12:32 am

Thanks for the reply. you might be correct in saying that there is nothing we can do about it other than creating aggregates for different levels [which is quite difficult if no. of dimensions and hirearchies for each dimension increases]. Problem in depending users to create count(distinct customer_id) is they are not that technically savy and are requesting us to make it simple. But I think I got the answer [which is ..It's not possible].
Thanks.

BrianJarrett Sun Jun 06, 2010 12:40 am

So the main issue here is fact table grain. The problem is that when you roll this to a single row per customer, there's no way you can include things like service type or office because they create a many to many situation. Since it doesn't obey the grain you have to drop it or come up with another crafty way to force it to work.

One way to get this to work is to create a bridge table. First, you create a surrogate key on your fact table (to get a single key representing your composite key). I'll call this "fact_key". Then create the bridge table with at least two columns; one that will hold "fact_key" and another that will hold your dimension's surrogate key (let's call this service_dim_key). So it would look something like this:

FACT_KEY
SERVICE_DIM_KEY

Now, let's say you have a fact row with three service types. In your low level table you'll have three rows. In your agg table you'll have one row. Your fact surrogate key is 145, your service type dim keys are 4586, 458, and 785. Your rows would look like this in your bridge table:

FACT_KEY|SERVICE_DIM_KEY
145|4586
145|458
145|785

The bridge table would sit between your fact table and your dimension table. Keep in mind that you'll return (in this case) 3 fact rows for each join over to the dimension. You'll need to handle that triplication of fact quantities/dollars, typically with a divisor column (which in this case would be 3 for each row).

The problem with this is that you lose certain granularity at the customer level. Let's use our example of 3 fact rows. Let's say you have a REVENUE column in that fact. For service type 4586 you made $5, for service type 458 you made $10, and for service type 785 you made $15. In the lower level fact table you'll be able to see that revenue, broken out by service type. Obviously you'll sum that at the customer level, resulting in revenue of $30. With our bridge table example we could still see the $30 total but we'd lose how it breaks down between each service type. We'd know that $30 in revenue belonged to three service types, but exactly how we don't know.

Another drawback to this is that you'll need to store a divisor (in this case 3, since there are three associated dimensions). A user must know to bring in the divisor column to remove the duplication. If they inner join fact to dimension using the bridge table in between they'll return 3 "duplicated" fact rows, one per distinct dimension, resulting in $90 in revenue. They'll then have to take your divisor column and fix this ($90 / 3 = $30).

So, to summarize, you can handle the many-to-many relationships with a bridge table but you'll lose visibility as to how your aggregated fact breaks out between your many dimensions.

Another approach might be to choose how many service types you'll support. You'd have to get the business to agree to how many different service types you'll support and then build your customer level aggregate with the same number of columns. Using our example above you'd build a fact table with columns like SERVICE_DIM1_KEY, SERVICE_DIM2_KEY, and SERVICE_DIM3_KEY. You'd then alias your service dimension three times, joining to the corresponding service dim surrogate foreign keys in your fact table. The problem with this is what happens if there are 4 service types? Or 5? You're now stuck with an inability to report them all and you'd have to dump any extra service types into an "unknown" or "other" bucket. Plus if you wanted to see "REVENUE" broken out by each service type you'd have to have a separate column for each and then sum across. Personally I'd avoid this scenario.

If it were me I'd build the aggregate and drop the dimensions that don't fit the grain. Set it up correspondingly in your universe with agg aware and incompatibilities, then let your users know that when they run queries at the service type level they're going to have to wait longer. Any other queries will hit your customer level aggregate. Bridge tables are a solution though, albeit more complex and difficult to maintain and use.

Sponsored content

Aggregates with Distinct

Aggregates with Distinct

Re: Aggregates with Distinct

May be my explanation was misleading ...

Re: Aggregates with Distinct

I wish there is something else we could do..but

Re: Aggregates with Distinct

Re: Aggregates with Distinct