NULLs in source lead to duplicates in Fact table

View previous topic View next topic Go down

NULLs in source lead to duplicates in Fact table

Post  mselway on Fri Mar 19, 2010 5:39 am

Hi,
my historic source data has inconsistencies. I have a product with a 1:m to product_detail which has a 1:m to invoice which has a 1:m to invoice detail.
Invoice_detail is the grain of my fact table, but some of the historic data has nulls in the link from product_detail to invoice and invoice to invoice_detail.
My load process assigns a -1 (unknown member) to where these nulls occur, but this results in duplicate values in the invoice_detail attribute which, being the grain of the fact table, results in a PK violation.
I can't think how to resolve this conflict, any ideas?

Thanks, Martin
avatar
mselway

Posts : 6
Join date : 2009-10-14
Location : London, England

View user profile http://www.swiftcover.com

Back to top Go down

Re: NULLs in source lead to duplicates in Fact table

Post  ngalemmo on Fri Mar 19, 2010 11:45 am

Don't enforce PK contraints. You don't need them.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: NULLs in source lead to duplicates in Fact table

Post  BoxesAndLines on Fri Mar 19, 2010 3:33 pm

You can add an error flag or even go more verbose with the audit dimension to flag those rows that should be omitted from reporting.
avatar
BoxesAndLines

Posts : 1212
Join date : 2009-02-03
Location : USA

View user profile

Back to top Go down

Re: NULLs in source lead to duplicates in Fact table

Post  ngalemmo on Fri Mar 19, 2010 4:26 pm

My read from the description is that the facts themselves are not duplicates, but rather because one or more of the foreign keys default to 'unknown' it causes duplicate primary key errors. So, it becomes an issue as to weither you should enforce PK constraints.

The fact that they reference an 'unknown' dimension row should be sufficient for reporting.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: NULLs in source lead to duplicates in Fact table

Post  mselway on Sun Mar 21, 2010 4:53 pm

Thanks for your input.
My problem is... if I don't enforce PK integrity the 'unknown' member in the invoice table (approx 10k rows) performs a left outer join with the invoice_detail unknown rows (approx 1000) and I end up with a cartesian product (10k * 1k = 1m rows).
I can flag this in the error_fact, but I'd rather avoid it. If the Invoice did not have nulls and referenced a single invoice_detail unknown member (with a zero amount) the fact table makes sense. By excluding them I miss 10k rows of valuable product and product_detail data.
The relationship between product_detail and invoice is the product_detail_id in the invoice table, I could create these missing invoice rows in my staging database which would allow the load process to work without either missing data or generating too much.
Can you see any downside to this approach?
avatar
mselway

Posts : 6
Join date : 2009-10-14
Location : London, England

View user profile http://www.swiftcover.com

Back to top Go down

Re: NULLs in source lead to duplicates in Fact table

Post  ngalemmo on Mon Mar 22, 2010 11:45 am

I have a product with a 1:m to product_detail which has a 1:m to invoice which has a 1:m to invoice detail.
Invoice_detail is the grain of my fact table, but some of the historic data has nulls in the link from product_detail to invoice and invoice to invoice_detail.

Looking at your first post again, I am a bit confused with the relationships. What does "Invoice" represent? Is it a line or a header? If it is the latter, it shouldn't have any relationship to product. If it is a line, and "invoice detail" contains account distribution information for a line, the detail table should have assumed all the dimensional keys of invoice (in additon to others it may need), so there would not be a need to join invoice and invoice detail. But even if you do, the two fact tables should be queried separately, summarized and then joined on all common dimensions (or you can do a union), there should not be a cartesian product.

I am also assuming you are using surrogate keys. The 'unknown' dimensions should reference a row in the appropriate dimension tables. No FK should be null.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: NULLs in source lead to duplicates in Fact table

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum