NULLs in source lead to duplicates in Fact table
Page 1 of 1 • Share •
NULLs in source lead to duplicates in Fact table
Hi,
my historic source data has inconsistencies. I have a product with a 1:m to product_detail which has a 1:m to invoice which has a 1:m to invoice detail.
Invoice_detail is the grain of my fact table, but some of the historic data has nulls in the link from product_detail to invoice and invoice to invoice_detail.
My load process assigns a -1 (unknown member) to where these nulls occur, but this results in duplicate values in the invoice_detail attribute which, being the grain of the fact table, results in a PK violation.
I can't think how to resolve this conflict, any ideas?
Thanks, Martin
my historic source data has inconsistencies. I have a product with a 1:m to product_detail which has a 1:m to invoice which has a 1:m to invoice detail.
Invoice_detail is the grain of my fact table, but some of the historic data has nulls in the link from product_detail to invoice and invoice to invoice_detail.
My load process assigns a -1 (unknown member) to where these nulls occur, but this results in duplicate values in the invoice_detail attribute which, being the grain of the fact table, results in a PK violation.
I can't think how to resolve this conflict, any ideas?
Thanks, Martin

mselway- Posts: 4
Join date: 2009-10-14
Location: London, England

Re: NULLs in source lead to duplicates in Fact table
Don't enforce PK contraints. You don't need them.

ngalemmo- Posts: 557
Join date: 2009-05-16
Location: Los Angeles

Re: NULLs in source lead to duplicates in Fact table
You can add an error flag or even go more verbose with the audit dimension to flag those rows that should be omitted from reporting.

BoxesAndLines- Posts: 231
Join date: 2009-02-04
Re: NULLs in source lead to duplicates in Fact table
My read from the description is that the facts themselves are not duplicates, but rather because one or more of the foreign keys default to 'unknown' it causes duplicate primary key errors. So, it becomes an issue as to weither you should enforce PK constraints.
The fact that they reference an 'unknown' dimension row should be sufficient for reporting.
The fact that they reference an 'unknown' dimension row should be sufficient for reporting.

ngalemmo- Posts: 557
Join date: 2009-05-16
Location: Los Angeles

Re: NULLs in source lead to duplicates in Fact table
Thanks for your input.
My problem is... if I don't enforce PK integrity the 'unknown' member in the invoice table (approx 10k rows) performs a left outer join with the invoice_detail unknown rows (approx 1000) and I end up with a cartesian product (10k * 1k = 1m rows).
I can flag this in the error_fact, but I'd rather avoid it. If the Invoice did not have nulls and referenced a single invoice_detail unknown member (with a zero amount) the fact table makes sense. By excluding them I miss 10k rows of valuable product and product_detail data.
The relationship between product_detail and invoice is the product_detail_id in the invoice table, I could create these missing invoice rows in my staging database which would allow the load process to work without either missing data or generating too much.
Can you see any downside to this approach?
My problem is... if I don't enforce PK integrity the 'unknown' member in the invoice table (approx 10k rows) performs a left outer join with the invoice_detail unknown rows (approx 1000) and I end up with a cartesian product (10k * 1k = 1m rows).
I can flag this in the error_fact, but I'd rather avoid it. If the Invoice did not have nulls and referenced a single invoice_detail unknown member (with a zero amount) the fact table makes sense. By excluding them I miss 10k rows of valuable product and product_detail data.
The relationship between product_detail and invoice is the product_detail_id in the invoice table, I could create these missing invoice rows in my staging database which would allow the load process to work without either missing data or generating too much.
Can you see any downside to this approach?

mselway- Posts: 4
Join date: 2009-10-14
Location: London, England

Re: NULLs in source lead to duplicates in Fact table
I have a product with a 1:m to product_detail which has a 1:m to invoice which has a 1:m to invoice detail.
Invoice_detail is the grain of my fact table, but some of the historic data has nulls in the link from product_detail to invoice and invoice to invoice_detail.
Looking at your first post again, I am a bit confused with the relationships. What does "Invoice" represent? Is it a line or a header? If it is the latter, it shouldn't have any relationship to product. If it is a line, and "invoice detail" contains account distribution information for a line, the detail table should have assumed all the dimensional keys of invoice (in additon to others it may need), so there would not be a need to join invoice and invoice detail. But even if you do, the two fact tables should be queried separately, summarized and then joined on all common dimensions (or you can do a union), there should not be a cartesian product.
I am also assuming you are using surrogate keys. The 'unknown' dimensions should reference a row in the appropriate dimension tables. No FK should be null.

ngalemmo- Posts: 557
Join date: 2009-05-16
Location: Los Angeles

Permissions of this forum:
You cannot reply to topics in this forum





