how to identify and remove faulty records from dimension and facts
3 posters
Page 1 of 1
how to identify and remove faulty records from dimension and facts
There is a difference of records between source table and the dimension table. From that we found that this is either due to the record may be not present in source as it may be deleted in the source. As source tables are loaded with full load truncating the table and dimension table are loaded with incremental
load.So if there is a duplicate record in dimension table we can delete the record but how come we can update the sids either in dimension, even the same sid being using in fact, so how to handle the effect or update the sid in both dimensions and facts.And at last how to cleanse the faulty records.
Thanks
Rama
load.So if there is a duplicate record in dimension table we can delete the record but how come we can update the sids either in dimension, even the same sid being using in fact, so how to handle the effect or update the sid in both dimensions and facts.And at last how to cleanse the faulty records.
Thanks
Rama
ramakrishna- Posts : 5
Join date : 2011-09-07
Age : 35
Location : chennai
Re: how to identify and remove faulty records from dimension and facts
As source tables are loaded with full load truncating the table
I assume you are taking about staging the source data.
So if there is a duplicate record in dimension table we can delete the record
Why would you do that? Why not simply update the dimension row?
Re: how to identify and remove faulty records from dimension and facts
why would you do that? Why not simply update the dimension row?
as i said that we are not getting the source and target counts of the dimension equal. so for that we found the records that are prsent in target but not in source, so we want to delete those records. so is it good to delete or how to solve the above thing to get equal counts.
thanks for ur reply
as i said that we are not getting the source and target counts of the dimension equal. so for that we found the records that are prsent in target but not in source, so we want to delete those records. so is it good to delete or how to solve the above thing to get equal counts.
thanks for ur reply
ramakrishna- Posts : 5
Join date : 2011-09-07
Age : 35
Location : chennai
Re: how to identify and remove faulty records from dimension and facts
Deleting rows from the dimension table? You never delete rows from a dimension table (unless you are carefully purging old data from DW), otherwise you lose referential integrity with the facts.
You avoid the problem you are trying to solve by not deleting dimension rows. SIDs don't change.
So if there is a duplicate record in dimension table we can delete the record but how come we can update the sids either in dimension, even the same sid being using in fact, so how to handle the effect or update the sid in both dimensions and facts.
You avoid the problem you are trying to solve by not deleting dimension rows. SIDs don't change.
Re: how to identify and remove faulty records from dimension and facts
thanks for ur reply
ramakrishna- Posts : 5
Join date : 2011-09-07
Age : 35
Location : chennai
An example
Could you please be more precise?
Let’s say that I have a contact SCD2 dimension.
ID (surrogate)
Address
Valid_From
Valid_To
Is_Current_Flag
And I want to track record deletion as well. As I understood this can be done via additional attributes like Is_Deleted_Flag and if necessary Delete_Date.
So let’s say that I have (history) situation like this one:
ID Address Valid_From Valid_To Is_Current_Flag Is_Deleted_Flag Delete_Date
1 Palm Street 01/25/2012 01/27/2012 N N null
2 Baker Street 01/28/2012 01/01/3000 Y N null
Deletion on production is performed on 02/15/2012. Did I understand well that the second record will end up like
2 Baker Street 01/28/2012 01/01/3000 Y Y 02/15/2012
?
Thanks in advance,
Zoran
Let’s say that I have a contact SCD2 dimension.
ID (surrogate)
Address
Valid_From
Valid_To
Is_Current_Flag
And I want to track record deletion as well. As I understood this can be done via additional attributes like Is_Deleted_Flag and if necessary Delete_Date.
So let’s say that I have (history) situation like this one:
ID Address Valid_From Valid_To Is_Current_Flag Is_Deleted_Flag Delete_Date
1 Palm Street 01/25/2012 01/27/2012 N N null
2 Baker Street 01/28/2012 01/01/3000 Y N null
Deletion on production is performed on 02/15/2012. Did I understand well that the second record will end up like
2 Baker Street 01/28/2012 01/01/3000 Y Y 02/15/2012
?
Thanks in advance,
Zoran
Zoran Milokanović- Posts : 3
Join date : 2012-03-20
Re: how to identify and remove faulty records from dimension and facts
If you are going to have a delete flag, why not set the valid to date to the delete date, as the row would no longer be valid? You really don't need the extra column.
Re: how to identify and remove faulty records from dimension and facts
ngalemmo wrote:If you are going to have a delete flag, why not set the valid to date to the delete date, as the row would no longer be valid? You really don't need the extra column.
Yeah, this is a really good question that I even asked myself. ;DD
First of all I accidently did a mistake posting a question here instead of inside topic “Question on deleting records…”
http://forum.kimballgroup.com/t1181-question-on-deleting-records-from-dimension-tables
Anyway, the thing is that I didn’t find in any of Kimball books that I have at home how to deal with (source) record deletion when having a dimension with valid from/to attributes (the one that have at least one SCD 2 attribute).
With SCD 1 dimension (the one with all SCD 1 attributes) I found few suggestions that include additional attribute to mark record (in)activity. And If I want to know exact dateTime when record became inactive or deleted that would mean one more attribute as well (Delete_Date).
With, let’s call it, SCD 2 dimension I found zero suggestions and after reading the conversation from link above I made a wrong conclusion and asked a question in order to clarify it. So, my question would be… How to handle it in SCD2 dimensions?
Common sense says exactly what you suggested….that Valid_To has to be set to delete dateTime. I guess I would need attribute Is_Deleted_Flag to simply distinguish those records that are regularly closed with SCD2 attribute change from those that are deleted on the source side. Maybe attribute Change_Reason is also something good to have but let’s say that is less important. A small mystery could be what to do with Is_Current_Flag in case when record is deleted. I would say that it has to be set to value N and three reasons come up on my mind why. First, current records are those that I could currently find on the source and the one deleted I cannot find there. Second, current records are those with MAX DATE TIME value contained in Valid_To attribute and deleted record don’t satisfy that condition. Third, in general if I want to maintain SCD 2 attribute change I'd create a new record with next surrogate key and I'd try to find previously most current record in order to update it to N value and to update Valid_To attribute. Let's say that deleted record apears on the source after some time in the future. The record than was initially deleted is already properly set (delete time, current flag = 'N') and it would be wrong to handle it as a regular SCD 2 attribute change (second activation/entry date minus 1 second or 1 day, current flag = 'N'). Am I right?
Last edited by Zoran Milokanović on Wed Apr 18, 2012 5:03 am; edited 1 time in total
Zoran Milokanović- Posts : 3
Join date : 2012-03-20
Re: how to identify and remove faulty records from dimension and facts
In a typical type 2 update, the pattern is to expire the current row and create a new current row. Note, 'current row' is not the same thing as the last known row for a particular natural key. So, if you soft-delete a type 2 row by setting the expire date and making it no longer current (set current flag to false), the process to handle a future resurrection is the same as any other type 2 update. You would not find a 'current row' and would simply treat it as a new row with a new effective timestamp. There would be no update to the prior row. No special logic necessary.
As for having a delete flag to distinguish between updates and source system deletes, that's fine.
As for having a delete flag to distinguish between updates and source system deletes, that's fine.
Re: how to identify and remove faulty records from dimension and facts
Exactly that was my assumption. So, current records in dimension table (flag Y) represent current picture of the source system NOT the last known one, meaning that deleted row has to have current flag set to N. Many, many thanks. :DD
I just wanted to add that I understand why I didn’t find (much) articles about deleting records that are used as a source for SCD 2 dimension. In general this shouldn’t be possible due to referential integrity in a source system. However in practice, especially in a large (bank or telco) systems it happens during the regular DB maintenance where old records are backed up and purged. I know that in most cases this affects fact records but sometimes old services and products that are not part of current/active company portfolio could be purged as well. In that case by purging some of the records from lookup tables it affects corresponding dimension tables that already have those records.
I just wanted to add that I understand why I didn’t find (much) articles about deleting records that are used as a source for SCD 2 dimension. In general this shouldn’t be possible due to referential integrity in a source system. However in practice, especially in a large (bank or telco) systems it happens during the regular DB maintenance where old records are backed up and purged. I know that in most cases this affects fact records but sometimes old services and products that are not part of current/active company portfolio could be purged as well. In that case by purging some of the records from lookup tables it affects corresponding dimension tables that already have those records.
Zoran Milokanović- Posts : 3
Join date : 2012-03-20
Similar topics
» Identify the facts and facts grain
» Remove degenerate dimension?
» Dimension more records than fact
» Dimension with million of records - Performance on delivery
» Question on Deleting records from dimension tables
» Remove degenerate dimension?
» Dimension more records than fact
» Dimension with million of records - Performance on delivery
» Question on Deleting records from dimension tables
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
|
|