Conformed Dimension Question

BICorvette Tue Apr 08, 2014 9:29 pm

Hello,

I am fairly new to dimensional modeling using the Kimball method and would like to obtain your opinions on a specific modeling scenario I am encountering.

I have 2 business processes that I have derived from the enterprise business requirement findings stage of my project which correlate to the publishing of policy
premium related data for automobile insurance policies.

These 2 business processes are listed below.

1. Policy Premium Transactions
2. Policy Premium Periodic Snapshot

With this said, the underlying data source that I will be referencing to build this DW/BI solution consists of a flat file derived from a mainframe system.

This flat file (+300 million records) is at the transaction level of granularity and contains a raw “dump” of each transaction that occurs on a policy as it happens.

This would imply that each record contained within this flat file would contain the following data:

1. The amount of the policy transaction (total premium)
2. The type of transaction (cancellation, renewal, change, etc..)
3. Many dimension attributes tied to the transaction such as:
a. Policy holder information (age, gender, marital status, etc..)
b. The insurance program
c. Coverage limits information
d. Insured vehicle information such as the make, model, and year of the vehicle.
e. Location of the primary insured and vehicle

It would be important to note that no "customer number" exists from the source system, and so one would not be able to truly identify if a "John Smith" from policy
"A" written during CY 2005 for example is the same "John Smith" that canceled coverage on this policy and then later returned as a policy holder to policy "B" during
2007.

The source system simply "dumps" all dimensional attributes that are present on a policy to this file as transactions are processed during the lifecycle of a policy.

As such, if there are 4 different transactions that are processed during the lifecycle of a policy it would be possible for the majority of all dimensional attributes
present within each of the 4 transactions to be the same.

This can be illustrated below:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
|Policy # | Insured Name | Insured State | Term Eff Date | Term Exp Date | Coverage | Coverage Limit | Trans Type | Trans Amount| Trans Date
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
ABC123 John Smith CA 1/1/2012 1/1/2013 Collision $40,000.00 Renewal $93.12 1/1/2012
ABC123 John Smith CA 1/1/2012 1/1/2013 Collision $50,000.00 Change $50.23 1/9/2012
ABC123 John Smith CA 1/1/2012 1/1/2013 Collision $80,000.00 Change $70.23 8/5/2012
ABC123 John Smith CA 1/1/2012 1/1/2013 Collision $40,000.00 Change ($30.23) 9/8/2012

As shown above, there were 4 transactions that were issued.

Transaction #1 was a renewal transaction where the insured renewed his policy.
Transaction #2 entailed the insured increasing the policy limits of their collision coverage to $50,000.00
Transaction #3 entailed the insured increasing the policy limits of their collision coverage to $80,000.00
Transaction #4 entailed the insured decreasing the policy limits of their collision coverage back to $40,000.00

As illustrated within the abbreviated example above, no unique customer number is brought in from the source system. Again, it would seem that the source system
simply "dumps" all attributes and facts tied to each transaction into a single transaction record as illustrated above. These attributes are then repeated again and
again for each transaction regardless if they do not change...

With the information above in mind, I ask myself how I would construct conformed dimensions using this source transactional data.

I also ask how SCDs would be handled in this scenario such as when the insured moves, changes marital status, or other events that would trigger changes in
dimensional attributes.

With this said, I am proposing the following approach used to construct and maintain conformed dimensions based on this transactional source data:

1. Each conformed dimension will be built and maintained by selecting all distinct combinations of attributes from the source transactional flat file.
a. For example, for the [Policy Holder] dimension, I would select all distinct combinations of attributes that make up the dimension from the flat file and use these distinct values to build the dimension.
  b. I would then assign a SHAH512 hash value to each row within the dimension that would be built off of each attribute value within the dimension.
  c. When new data arrives, I would simply compare the SHAH512 hash value derived from the new incoming data to the existing data contained within the dimension and add or discard the record as needed.
   d. I am thinking that type 2 SCDs would not be applicable in this scenario as if the customer changes his/her address or gets married for example, the updated attribute values would constitute a new dimension record within the [Policy Holder] dimension, with the new dimension record being associated to the new fact record.

Is this an acceptable/correct approach to take when modeling conformed dimensions based on raw data derived from a transaction grained flat file?
Am I correct in assuming that due to the transactional nature of the data, no type 2 SCD scenarios would exist as they would be handled with each subsequent
transaction that is processed?

I greatly appreciate your thoughts on this.

Take care,

nick_white Wed Apr 09, 2014 3:25 am

It sounds like the transactional system not having a customer number is a bit of a flaw in it's design - do you really not know if the same person owns multiple policies? Assuming redesigning your source system is out of scope for this discussion ...

Is it correct to assume that the owner of a policy cannot change? i.e. if a policy was owned by person A it can't subsequently be owned by person B? If this is the case then you can identify individuals via their policy number and use this as part of your SCD logic. Obviously this doesn't help you de-dupe your policyholders across policies.

However, if your business cannot identify individuals who hold multiple policies then it might be worth asking the question why you need the policyholder dimension at all. Is there actually a requirement to know a policy is held by a 'John Smith' when you have 1000s of John Smiths in your system with no way of knowing which refer to the same person - what useful reports can this information generate?. Why not pick up the attributes that do give you useful, reportable, data and put them into relevant dimension(s) e.g. title, gender, marital status, zipcode, city, state, etc.

BICorvette Wed Apr 09, 2014 4:27 pm

Thank you for your reply Nick.

I agree that the transactional system is flawed by not having a customer number.

I can also agree with the question pertaining to the true business value that an insured name would provide for business intelligence purposes. Perhaps this sort of lookup would be better left for the point of entry transactional OLTP system.

I still would like to validate that my approach for building conformed dimensions is correct.

For example, please consider the following [Coverage] dimension I am proposing:

[Coverage]
----------------------------
Coverage Code
Coverage Description
Coverage Limit 1
Coverage Limit 2
Deductible Amount
Endorsement Code
Endorsement Description
----------------------------

Should the dimension above be built based on selecting all unique value combinations of these attributes from the source transactional data that would encompass the dimension using the approach I outlined earlier?
Or should I have one row of data contained within the dimension per policy number knowing that duplicate attribute values would exist within the dimension?

I have outlined both of these approaches below:

Source Data:

Conformed Dimension Question Data_e10

Based on this data above, one can see that the same exact attribute values are repeated for different, unique policy numbers.

Approach #1: Selecting All Unique Attribute Values

Using this approach, I would select a distinct list of attribute values from the source data and condense the values into a unique list of dimension members as shown below:

Conformed Dimension Question Approa10

As shown above, the duplicate values from the source data are condensed into single unique rows within the dimension table.

Type 2 SCD scenarios would be implicitly handled due to the transactional system feeding the dimension updated, unique values which would get propagated into the dimension, and used against new facts.

Approach #2: Duplicate Values Using Business Key

This approach would establish a natural/business key off of the [Policy Number] and [Vehicle Number] columns and would maintain one record within the resulting [Coverage] dimension per unique [Policy Number] and [Vehicle Number] value combination.

This approach is illustrated below:

Conformed Dimension Question Approa11

This approach does duplicate dimensional attribute values across many rows within the dimension and would grow very large if many policies and vehicles write the same coverages. I would also assume that type 2 SCDs would be difficult if not impossible to implement using this approach as if a new coverage is instituted in the source system with new [Coverage Limit] values for instance, a new record would get inserted into the [Coverage Dimension] , leaving existing rows "as is".

I'm thinking my first approach above would prove beneficial as the dimension itself would remain compact and would truly reflect distinct values that could be tied to facts. Unless I am mistaken, I would also think that SCDs would be non existent (at least for this dimension) due to any new dimensional attribute values being found within new transaction records being inserted into the [Coverage Dimension] as new rows.

Which approach would you recommend for modeling this "Coverage" oriented dimension, as well as other dimensions I would create off of this unstructured transactional data?

I greatly appreciate your input.

Thanks!

nick_white Thu Apr 10, 2014 11:49 am

As a set of principles, Dimension values should be unique and a Dimension should be a self-contained list of attributes => Approach 1 is probably best: create a record for each unique combination of attributes and don't start including things like policy keys in it. Remember - Dims are linked via Fact tables not directly to each other (ignoring bridge tables, snowflakes and all the other exceptions to this rule).

Things you might want to consider (not necessarily for this particular case, more general points):
- if the number of records in this Dim is too large (and it's your judgement call of what too large means) then probably some of the attributes are not correlated to each other and you should split them into separate Dims containing only those attributes that do have a correlation
- if you can pre-determine all the possible combinations of attributes then the best approach is probably to populate the Dim in a one-off load (plus any subsequent maintenance that is required). But if this results in a large number of records with a significant proportion that will never get used (because they are a theoretical combination rather than one that would occur in the real world) you may want to reconsider and use the approach in the next point
- if you can't pre-determine the combinations you will have to load the Dim as part of your transactional load (does the combination of attributes on this record match a record I already have in the Dim? Y - use that one; N - create a new record in the Dim). This has performance implications and obviously you have to create the Dim record, and know its Surrogate Key, before you can load the fact record - so it all takes a bit more thought and planning

Hope this helps your thought processes a bit more

BoxesAndLines Thu Apr 10, 2014 4:05 pm

Typically, Policies and Coverages have a large number of attributes which make a junk dimension (distinct rows approach) not feasible. If your data is limited to this example then your approach should work fine.

BICorvette Thu Apr 10, 2014 10:25 pm

Thank you Nick and BoxesAndLines for your quick and informative replies, it is very much appreciated.

I will proceed with my first outlined approach (distinct rows approach) keeping the design concepts and recommendations you outlined in mind.

Thanks again!

Sponsored content

Conformed Dimension Question

Conformed Dimension Question

Re: Conformed Dimension Question

Re: Conformed Dimension Question

Re: Conformed Dimension Question

Re: Conformed Dimension Question

Re: Conformed Dimension Question

Re: Conformed Dimension Question