Kimball Forum
Would you like to react to this message? Create an account in a few clicks or log in to continue.

Data lookup

2 posters

Go down

Data lookup Empty Data lookup

Post  Upen Mon Oct 24, 2011 1:12 pm

In my raw data I have 380+ million records and the primary key consists of 4 columns each 32 bit wide alphanumeric, and I get about 2 million new records every day. I need to be able to look up if the any of the new records currently exists in the 380 million and the processing time frame is very short. One solution I thought is to have a CRC generated on the 4 primary key columns and use the new CRC column as the lookup but I understand that Informatica CRC32 does generate the same CRC for different combinations. Any thoughts or suggestions would be greatly appreciated. Thank you in advance.
Upen

Upen

Posts : 1
Join date : 2011-10-24

Back to top Go down

Data lookup Empty Re: Data lookup

Post  ngalemmo Mon Oct 24, 2011 3:07 pm

CRC32 is a industry standard algoritm to create a 32 bit hash value of a string. Given the fact it is 32 bit, there are only a little over 4 billion possible values. Given the intended application, there will be collisions (duplicate values for different strings), and far more frequently than the numbers seem to indicate.

However, in your situation, using a 32 bit CRC is a good way to cut down on the size of your index and possibly speed up lookups. Define an index on the CRC and allow duplicates. Queries should select based on the CRC and the columns that make up the natural key. Do not include the natural key columns in the index... it just bloats the index and slows things down. The CRC will get you down to a handful of rows, probably 4 at the most, that will be searched sequentially.

But, if your intent was to use a CRC as a replacement of the natural key columns and use it in a lookup cache within INFA, it won't work. It is a statistical certainty that you will have duplicate CRC values (in fact, it is very likely in much smaller populations). The only hash option that makes the chance of a duplicate highly unlikely is to use a large hash. There are algorithms that produce 160 bit hash values, which will do the trick, but at 20 bytes, is probably too large to make caching 380M+ values practical in INFA.
ngalemmo
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

http://aginity.com

Back to top Go down

Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum