ETL: process row by row with pipeline between steps or process the whole datasets in every step?

View previous topic View next topic Go down

ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Post  juanvg1972 on Wed Jun 17, 2015 4:06 pm

I have a doubt about ETL process performance.

I have always believed that the better way to process a dataset in a ETL process in term of performance is to process the whole dataset in every step and pass the completed and transformed dataset to the next step of the flow.

I have worked like that in SAS (typical sas data steps), but working in Pentaho DI, I can see that the rows flow to the next setps without being completed the whole dataset, is that good for performance?, in that case the datasets is being processed row by row and the processes rows sent to the next step.

I know that there is a limit in the row by row processing because some steps need the whole dataset to be processed.

I want to know what is the best way to process.

Thanks in advance

juanvg1972

Posts : 25
Join date : 2015-05-05

View user profile

Back to top Go down

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Post  ngalemmo on Wed Jun 17, 2015 4:38 pm

SAS is not an ETL tool. It can be used for ETL, but so can a lot of things.

The most expensive operation any process can perform is I/O to disk, writing being worse than reading. So, purpose built ETL tools are optimized to do ETL. SAS isn't.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

row by row of whole dataset in every step

Post  juanvg1972 on Wed Jun 17, 2015 4:45 pm

Thanks ngalemmo,

I know SAS is not an ETL tool, although SAS/Data Integration Studio could be consider like that.

Anyway, my question in not SAS, my question is:

Which is more effcicient in an ETL to process row by row with pipeline between steps or process the whole datasets in every step and wait to be completed to send to the next?

Thanks,

Juan

juanvg1972

Posts : 25
Join date : 2015-05-05

View user profile

Back to top Go down

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Post  ngalemmo on Wed Jun 17, 2015 5:15 pm

Pipeline.  It reduces I/O.  The fewer times you need to touch disk, the faster things will run. Purpose build ETL tools do a lot of in memory caching for lookups and other purposes.

But there are exceptions.  For example, doing transformations in database on a MPP platform such as Netezza (aka IBM Pure Data) is extremely fast.  It is a matter of understanding the architecture of the system you are working with and leveraging its strengths.  A well designed process can transform hundreds of millions (if not billions) of rows in a matter of minutes.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Post  juanvg1972 on Thu Jun 18, 2015 11:21 am

Thanks ngalemmo,

another related question.

It's always better to avoid I/O, but when you process a datasets in memory, is better do the step row by row or process the whole dataset?.
I think is better to process the whole dataset, but if yoy process row by row yoy can pipe processed rows to the next step?

In terms if best practices, what is better?

Thank for your help

juanvg1972

Posts : 25
Join date : 2015-05-05

View user profile

Back to top Go down

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Post  ngalemmo on Thu Jun 18, 2015 11:46 am

It really depends on the tool you are using. There is no 'best' choice, only a best choice for the particular architecture.
avatar
ngalemmo

Posts : 3000
Join date : 2009-05-15
Location : Los Angeles

View user profile http://aginity.com

Back to top Go down

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Post  ron.dunn on Sat Jun 20, 2015 7:13 pm

I differ slightly, in that I believe in-database transformations are generally faster than in-tool for production-scale workloads. I've seen complex dimension processing through SQL-based ELT beat both SSIS and Kettle (for example) on the same hardware, DBMS and data.

(I spend a lot of my time performance testing ETL/ELT tools)

ron.dunn

Posts : 55
Join date : 2015-01-06
Location : Australia

View user profile http://ajilius.com

Back to top Go down

Re: ETL: process row by row with pipeline between steps or process the whole datasets in every step?

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum