ETL design and performance questions

View previous topic View next topic Go down

ETL design and performance questions

Post  juanvg1972 on Tue May 05, 2015 12:32 pm

I have some questions about ETL design and process optimization:

1) CDC systems: I know CDC systems used in ETL process that read the database log (for example Oracle redolog) in order to get the changes and create a input to the ETL process, I suppose there are more CDCsystems, which is the best one?, whar are the differences?

2) indexes: I have read that when you are making a load to database in a ETL process is better to disable indexes and enable it after loading process. is right?, is this the best practice?. I think the enable process cant take long....isn't it??

3) Foreign keys: I have read is better to validate integrity between dimensions key and fact table in the code of the process not with database constraints for performance reasons..is it rigth??

4) I have read is better to use sql native connections from ETL to Source database, in terms of performance, any other best practice in connectors?

5) Working in-memory in ETL: Can be problems doing ETL steps in memory with large datasets?, which is the best option?

6) ETL vs ELT: I have always thougth that is better to do the ETL work in the ETL tool than in database, but I know that there are ELT architectures like Oracle Data Integrator, which is the best options? advantages and disadvantages of each one?

7) If I want to define restart points in the ETL process, how can I do?, I have long ETL process and in case of errors I don't want to execute all the processes then I wnat to save an intermediate table to continue. is that possible in PowerCenter? ant other option?

datawarehouse appliance: is a server with configurations oriented to dwh processes??, what special configuration is this??, an example..teradata? netezza?

Thanks in advance

juanvg1972

Posts : 25
Join date : 2015-05-05

View user profile

Back to top Go down

Re: ETL design and performance questions

Post  ngalemmo on Tue May 05, 2015 4:45 pm

#2 - Disabling indexes during load is a common practice, but it depends on the volume of data you are loading. This is usually done with fact tables, but rarely with dimensions since the load volumes (and the creation of new rows) is not that great. What to do is a matter of testing to see which saves time.

#3 - Yes. Enforcing referential integrity in the database is redundant if you have a process in place to assign surrogate keys. It isn't necessary.

#4 - That is a matter of the particular DBMS you are using. Sometimes it makes a difference, sometimes it does not. Generally if they provide a native driver, you should use it.

#6 - ELT is useful when dealing with very large data volumes. Downside is it puts a load on the DBMS which could affect user response times. ETL is fine for moderate volumes (most typical applications). They typically cache lookups and such and can perform very well.

#8 - I can speak on Netezza. It is Postgres based, but they stripped it down and do not support indexes or referential integrity. The transaction model supports serialization only. It is an MPP system, can handle huge data volumes and performs really well.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum