ETL Design Problems for Real time

View previous topic View next topic Go down

ETL Design Problems for Real time

Post  sky87 on Thu Sep 18, 2014 4:53 am

Hi at all

I agonise my head in designing an ETL for a real time DWH. It is not so clear for me, what I have to do in which steps and i hope that you can help me a bit.
I tell you something about my initial situation.

I have two Source Systems, which are independent to each other and are used by different user applications. I have to design a DWH for Reporting issues. Some data in the source systems are the same. I will give you an example.

Table 1:


Productnumber|  Description|  Status|  Product
P1  xxx   1  Product1
P2  yyy1  Product2
P3  zzz1  Product3
Table 2:


ID|  Description|  Product | Productyear |Productnumber |
1  yyy  Product2 | 2001P2
2  xxxProduct1 | 2004P1
3  vvvProduct4 | 2006P4

It is possible that the application which is using Table 1, is inserting a Product 4 with vvv in a few days for example.

So what I want is to build a Productdimension with the Attributes Productnumber, Description, Product, Status and Productyear.
When I want to integrate those tables in one schema, i need a staging area, is that right?
So what have I to do?
Table 1 -> one-to-one copy in Staging Area --> select attributes for Datawarehouse--> combine (join) with table 2 --> name this Table Products --> Load Dimension
Table 2 -> one-to-one copy in Staging Area --> select attributes for Datawarehouse--> combine (join) with table 1 --> name this Table Products --> Load Dimension

It is not clear for me, what tables do I need for the complete process.

Another approach is the real time approach, where changed or updated data will be pushed to the Datawarehouse.

It would be great, if you can help me to get clearer.

sky87

Posts : 2
Join date : 2014-09-18

View user profile

Back to top Go down

Re: ETL Design Problems for Real time

Post  nick_white on Thu Sep 18, 2014 9:40 am

Hi - as a general principle you would use a staging area for loading any data into a DW. You load data "as is" from your source system(s) into the staging area, transform it and then load it into your DW. Your staging area would normally be a schema in the same DB as your DW (or at least on the same physical DB server) for performance reasons as you tend to do a lot of lookups between your staged data and your DW.

In your particular example I'm not sure how this data is structured. There appears to be a 1:1 relationship between the records in each table - is this correct or is this just because of the data examples you've given? 
Can you have multiple records (one for each year) for each Product in Table 2?
Are both tables populated at the same time for each Product and if not is one always populated first?

I'm a bit confused as most source systems are OLTP systems and have normalised data while these tables are not normalised.

Anyway - you can probably extract the data you want using a single SQL query and the write this data to your staging area:

SELECT T2.*, T1.STATUS
FROM Table2 T2
INNER JOIN Table1 T1 on T2.productnumber = T1.PRODUCTNUMBER -- Change to left join if appropriate

Hope this helps?
Regards,
Nick

nick_white

Posts : 366
Join date : 2014-01-06
Location : London

View user profile

Back to top Go down

Re: ETL Design Problems for Real time

Post  sky87 on Fri Sep 19, 2014 12:00 am

hi

thank you for your reply.
Okay I design a staging area. I have read that the staging area is normalized but not even in 3NF. If I am right I pull my data as it is in the staging area (1to1 copy), then I consolidate my table schemas without any constraints. Is this right?



In table two the product number and product is unique. I can´t add multiple products in this table.

The tables are not populated at the same time. I might be possible that one application ( whatever which application) changes or adds data at one time and the other later.

If i have my staging area and my datawarehouse is loaded with an initial load from the staging area, what have i to do with the data in the staging area after loading? What are best practises, if only changed data will be poplated in real time to the datawarehouse?


Regards
Isa

sky87

Posts : 2
Join date : 2014-09-18

View user profile

Back to top Go down

Re: ETL Design Problems for Real time

Post  nick_white on Fri Sep 19, 2014 5:49 am

Hi - this is not meant as a criticism but it sounds like your knowledge of ETL and data warehousing is rather limited - and this forum is probably not the best place to get a tutorial on these topics - if only because it would take days and hundreds of posts. Can I suggest that you read a book on the topic - Kimball's is a good starting point: http://www.amazon.co.uk/Data-Warehouse-ETL-Toolkit-Techniques/dp/0764567578/ref=sr_1_5?s=books&ie=UTF8&qid=1411123412&sr=1-5&keywords=kimball

Alternatively book yourself on a course - preferably one that focuses on whichever ETL tool it is that you are using.

Regards,
Nick

nick_white

Posts : 366
Join date : 2014-01-06
Location : London

View user profile

Back to top Go down

Re: ETL Design Problems for Real time

Post  Sponsored content


Sponsored content


Back to top Go down

View previous topic View next topic Back to top

- Similar topics

 
Permissions in this forum:
You cannot reply to topics in this forum