Specific answers to your questions

dlinstedt Sat Aug 27, 2011 6:50 am

Hi Hang,

The virtual (sql views) can provide type 1 and type 2 dimension views without storing copies of the data. Type 3 dimensions can be done, but require much more complex SQL because we are pivoting data over time in to hard-coded columns. I would not recommend producing type 3 dimensions virtually from the Data Vault model.

All of the "data over time" is stored in the Satellites. The Links are parent tables (much like factless facts) along with a child table (Satellite) hanging off the links. Except in the notion of a transactional Link (which looks exactly like a fact table. However, their are specific rules to when and how to use a transactional link, otherwise flexibility and data integrity can be lost. My book: at learnDataVault (dot) com (slash) purchase-book discusses the architecture and the rule sets in detail. Because the Data Vault model is a hybrid approach (cross between 3NF and Star Schema) in inherits best of breed of both models, along with some (but not all) of the structural components.

Without a table example about your comments about a "deeper fact table" I am having trouble seeing your points about performance.

Now, regarding surrogate keys: the Data Vault model uses surrogate keys for all tables, and again, because it's a hybrid of 3nf and Star Schema, it can make use of not only the "star-join" optimizations, but also the query elimination techniques AND the parallel query engines. Effectively the Satellite data is vertical partitioning, and can be further partitioned horizontally - for added gains. The Satellite structure is the Surrogate plus the date constraint, but again, because of table elimination, and partitioning mechanisms, only the tables that are needed are brought in to the query.

Hope this helps,
Dan Linstedt
http (colon) learnDataVault (dot) com

hang Sat Aug 27, 2011 10:06 am

dlinstedt wrote:Without a table example about your comments about a "deeper fact table" I am having trouble seeing your points about performance.

Hi Dan,

Sorry for the confusion. I really meant that Kimball's periodic snapshot fact table produces better performance at cost of of redundant data materialisation on the disk, resulting in much deeper fact table than effective dated fact or perhaps DV link tables. However without materialising the snapshot data in a physical table, you may need to cross join the fact with date dimension to have a virtual snapshot in memory to enable trend analysis, for instance by cube. I am concerned about the performance cost caused by cross join.

I can see the point of vertical partitioning by Satellites as the changing attributes in dimensions are stored in respective tables according to their changing patterns instead of wide denormalised dimension tables. There is no doubt the performance gain would be significant by normalising big dimensions. In dimensional modeling, Kimball also suggested normalising monster dimensions by breaking it down into set of tables based on attributes' changing patterns and cardinalities.

Correct me if I am wrong. I guess the critical difference between dimensional modeling and data vault is the treatment on dimensions. The former is to denormalise in general but normalise in some special cases, whereas in data vault, dimensions must be physically on 3NF while the denormalised dimension format may only be achieved by virtual layer for user consumption. Kimball suggests having minimal dimension entries in the fact table by reasonable dimension denormalisation to avoid centipede fact table. Data vault keeps dimension tables normalised in link tables to minimise the data redundancies in the dimension tables, and control dimension growth at most granular level.

The common ground for both methodologies would be minimal snowflaking, shallow joins between dimensions and facts. I am not sure how DV handles multivalued attributes. I guess it would use Links and store attributes in their respective satellites at level of their own grains. Does DV accept Junk dimension or mini dimension? or it is too denormalised and the fast changing low cardinality attributes should be in individual unworthy satellites. I have seen designs having 20 dimension tables with only two records in the notion of normlisation. Even in relational thinking, so many piece meal tables would make you feel uncomfortable. Believe me, few joins making tons of low cardinality attributes available on one platform is awfully convenient and comforting, although understanding the concept and implementing it is mind bending.

I don't think performance would be major concern in terms of dimension joins for both models. It really comes down to the ease of use at physical level. But ease of use in denormalisation is achieved at cost of data quality which can only be effectively addressed by normalisation at some stage of data warehousing, given the fact that most source systems don't provide high quality data.

In designing a BI system, you always end up with some sort of ODS no matter how undesirable it is normally regarded in pure dimensional thinking. In reality, most legacy OLTP systems are not good normalised system. I think that's why Master Data Management (MDM) has become increasingly popular as a part of BI strategy. The question is what model are these MDM based on? Can we build our own MDM system if the model is standardised like dimensional modeling or data vault?

dm_unseen Mon Aug 29, 2011 4:43 am

Hi Hang

FYI i'm a d dutch dv expert and prez of the dutch dv usergroup

To get on a take MDM. Most MDM systems lack good memory of the past, which means that DV can work for MDM systems as well. I know that some have made MDM systems with methods that are related to Data Vault. Esp when you start doing bitemporal data DV will be a lot easier than 3NF or Dimensional modeling. My take is that Data Vault is a good foundation for MDM, and can also be used to create fully historic MDM solutions. or keep the history for MDM solutions that do not keep any history for themselves.

To Thomas Kejser,

Extreem scalability has a lot of factors, and one of them is knowing the underlying system and HW. I'quite sure that big SQL Server DV implementations have their own challenges, and I know on the extreem scale SQL server is NOT a JOIN champion in any edition (PDW,FastTrack...). There are however DBMSses that do allow for superior JOIN performance like monetdb/Vectorwise.

I also want to caution to extend Dimensional modelling for full temporalisation. Esp bi-temporalisation and multi temporalisation do not work well with dimensional modelling. For this aspect Data Vault does far better. When time becomes a complication, model normalisation and generic temporalisation (and hence join performance) become a must, regardless of the kind of 'modelling techniques' (actually model 2 model transformations) you'd like to perform.

thomaskejser Mon Aug 29, 2011 4:47 am

Hi Dan

Good to see you are in still alive in this forum. I look forward to hearing your views and discussing with you.

First of all, let me start by establishing my credentials. I design data warehouse for everything from retailers, manufacturers, telcos and now investment banks. I have over 15 years of experience in the database industry and am currently leading the design of the largest SQL Server installations in the world. One of my specialities is movement of data between sources, "EDW" and marts - and this typically means that my system will be moving several TB of data every day. I too work with major financial institutions (which I, for obvious confidentiality reasons, cannot name). I think I have even run across a few of your designs out there in the field. I hope we can also agree from our field experience that just because customers are doing something, doesn't mean it is a good idea. Hence, the fact that someone build something using your techniques does not mean it is the best way to do things.

You made a lot of points, so I will go over them one at a time. Let me just reiterate that data models should, in my opinion, be built using proper assumptions about database engines and their capabilities. Let me also state where I agree with you: normalization in the way you use it for satellites in your model techniques is actually a good idea in many cases. This is something I am going to elaborate in my blog as I advance the argument there. I will be tracking your blog, so it will be interesting to ping back and forth.

To your points:

Dan: "When was the last time you needed to join "all" the tables in a data warehouse (for any data modeling technique)? If this is the case, then sure, the Data Vault has more joins, and therefore if you query the entire model, yes, it will perform worse than other models."

Joining "All" tables is really not needed to get the benefit - and I hope I did not come across as putting up such a far fetched straw man. The problem can be boiled down to joining TWO tables. Take TPCH as an example. In that schema, they split LINEITEM and ORDERS - this means that queries grouping on "dimensions" from both will incur an expensive join. A more efficient design would be to store the two tables pre-joined. Of course, you could co-locate the two tables, but this just brings me to:

Dan: "However, due to the normalization - AND the query optimization techniques such as table elimination, AND the co-location use of elements such as join indexes, the join performance is in fact FASTER than other modeling techniques."

Co-Location will allow you to join two large tables together at a reasonable speed (using a merge strategy). However, it only works if you are not seeking into those tables (only if you scan or bitmap scan). If you take a data vault model with two large hubs, linked by a large link, the co-location is often not feasible, because you can only store data sorted in one way (and bitmaps wont help you either). In other words: You can only co-locate if there is a common column to co-locate ON in the model - interestingly the data vault super normalization guarantees that to NOT be the case.

JOIN indexes is "cheating" you are giving away large storage to get more performance. Granted: Trading storage for performance is normally a good idea- for small tables or aggregates. But if you have large table structures, and the only reason you implement join indexes is to make them reasonably fast to query, then I ask you: why did you bother modeling them like that in the first place?

Dan: "Why? Because the row sets are normalized, because the tables can be "queried in parallel due to the normalization". Co-location is KEY as is data spread or distribution in an MPP environment."

I have no clue which database engine you refer to that create more parallelism by having tables "queried in parallel due to normalization". As you may recall, database implement read locks that allow multiple threads to touch the same table at the same time - this includes threads that are part of the same parallel query. To give you an example, SQL Server will happily scan a single table with 36GB/sec (and after that, you typically run out of IOPS on an SMP system).

Dan: "how many rows can I fit in a disk block before I push I/O higher on an exponential scale? The block-size of different database systems vary. In Teradata it used to be 64k, I think now they've boosted that to 256kb. Oracle can go to 64kb only on certain platforms, and so on.... Any RDBMS system on Microsoft is locked in to 8k physical, unless the physical disk is reformatted to be 64k, but even then, the Microsoft OS must work with 8k block sizes."

A few factual corrections: Microsoft Windows allows 64K blocks, but that is just the allocation unit. You can request any IOPS size during scans (for example, SQL Server will request 512K blocks). Oracle I believe can go all the way to 1MB page sizes on certain platforms. Not that it matters, because there is no point where I/O goes exponential in a star schema. You scan a table: the IOPS are linear with the table size. Of course, if you insist on normalizing, then all bets are off (since joins, especially the loop joins that the from/to in the vault model use) will be log n * m complexity and if you lack the right indexes - you get your exponential time (which, interestingly, only happen on normalized models).

Dan: "you reach a sweet spot, where performance is optimal - great, that means the I/O's required to get the rows out are the least amount of I/O's needed to access the data, and are balanced with the query. HOWEVER, you are still left with a SEQUENTIAL SCAN of a single table - and now, you are relying on the "smartness" of the disk and the RDBMS to know that you want a table scan, and to read-ahead the blocks in to RAM to gather the information. "

As opposed to what? Trying to cram the join into memory or relying on random disk access? Let me just reiterate: Having an indexes only access strategy on a large table is incredibly expensive, both in disk space and performance (since you will be trashing the RAM as you access each index). Also, please note that fast disk scans are exactly what the specialized DW database engines like Vertica, Greenplum, PDW and Neteeza are made to excel at.

Dan: "Continue to add width to the row (doesn't necessarily mean adding new columns). You will find that once you pass this sweet spot in width (denormalize by removing another join, and combining another tables' columns) that the performance will begin to degrade on an exponential scale. Why is that? Because the I/O is STILL stuck on sequential access, and it is still forced to TABLE SCAN, now, there are less rows per block - and something else starts to happen... Rows start to "chain over/across" blocks, which doubles, then triples, then quadruples the I/O - and for gosh sakes, if you hit fragmentation - watch out...."

I am curious which database vendor you have observed this behavior for? I am deeply familiar with the code base of SQL Server - and there is no such code or behavior in there. On the note of having wide rows: It is correct that this is an issue that a star schema model must struggle with, since surrogate keys are de-normalized into the star schema. But let us just do a quick back on the envelope calculation: A dimension key is typically 4-8 bytes, so is a measure (using some decimal, float or in). Let us just be worst case and say 8 bytes per column. In our smallest case, SQL Server, a page is 8K - so this leaves you with 1000 columns in the fact table. So can we agree this is not such a big deal?

On the other note about row width: I think most DW designers acknowledge that there is an overhead of "paying" for the scan of the columns you are not using. However, this is what column stores are all about, they allow you to touch only the column you need to touch and only incur the IOPS for that.

Again, I also have to say: As opposed to what? Indexes access to huge tables?

Dan: "This is where Normalization (many rows per block) combined with proper parallel query, and the right co-location can far surpass any denormalized model. Because you can achieve a BALANCE with the right layout and distribution of the data. It also means that the scalability is limitless (in keeping with the hardware changes). These are the laws of MPP, divide and conquer. "

I agree that MPP is divide and conquer. But I think we disagree on how to divide. "Divide" in MPP is typically done by hash distributing the large tables across the nodes and co-locating the small tables: by replicating or applying the same hash, if the tables share columns (see above note on co-location). So, you are not dividing the tables vertically, you are doing it horizontally (distributing rows, not columns). This is what leads to balance - and it just so happens that star schemas, with their small dimension tables and large facts, lend themselves very well to co-locating the dimension (by replicating them) and evenly distributing the facts.

I believe I have already answered the many rows/block point with the column store (and if you should find yourself in a DB engine that does not have those, feel free to vertically partition the fact table in the star schema or add a narrow, scan friendly index).

Dan: "Why can't data warehouses be built and developed the same way twice, three times, etc? If this is the case, why hasn't the industry matured and optimized the building of data warehouse systems by automating 80% or more of the mundane tasks? If this is the case, why are businesses still challenging the value and cost of implementing BI solutions? If this is the case, why does it take IT months or even years to "cross-integrate" new sources, new data sets, and provide new answers?"

I certainly hope you appreciate the complexity of even getting ACCESS to sources (and having someone explain WHAT the source is), not to mention the large business questions about how you want to QUERY the data. I think there is also wide agreement that this effort, not the modeling, represents the 80% of the work needed to build a warehouse. I think everyone on the IT-industry has beating their fist on the table saying: Why cant building IT systems be like building bridges. Books are written on the subject of why this is the case - I dont think the data warehouse is a special case here.

When it comes to the mundane tasks of modeling, this can be automated with meta-frameworks (interestingly, independent of which model you use) - something that several companies now specialize in (I personally built one that automates the creation of star schemas). Of course, if you are to use a meta-modeling tool, you must find yourself a business the appreciates that a well written meta model tool, operated by a few experts, beat the other strategy which is: outsourcing to an army of "ETL programmers" and "BI developers".

Dan: "because the RDBMS engines today don't fully handle complete temporality of data sets."
The temporal database is one of those dreams we have had in IT for many years. There are good reasons it has not been done in a performanant way (there are inherent complexities in this problem that does not lend themselves well to good solutions). I happen to agree that the satelite construct ("flaking off the history") is the best bet so far.

Dan: As for your claim "hard to bring in to a dimensional model" That simply is flat out false and un-true. I can (and have) built and loaded Star Schemas from a Data Vault within 45 minutes of receiving a 2 page requirements document filled out by business users / customers. So I'm not sure where you're getting your information about it being difficult to load or build star schemas.

This comes back to the point about co-location. Joining large links and hubs together is very expensive - and so is creating de-normalized type2 dimensions on top of satellites, joined over hubs, joined over links, joined over more hubs/satellites/links. Keeping track of the aggregated history between them is a proper nightmare in a big "parking lot sized" data vault model.

Dan: "In fact, in Teradata I can build VIRTUAL star schemas (views that look like dimensions and facts), and for these cases, I never have to land the data physically anywhere after the Data Vault. In SQLServer, I can push the data from the Data Vault direct to an OLAP cube without resorting to materialized views. "

Building such views only defers the cost of the join to the cube or the query time. Moving the problem around like that does not make it go away - but you are now paying the join cost several times (every query) instead of only once (during the ETL). This is one of the points I will make in my blog in more detail, suffice to say that it is a big waste of hardware resources to implement non-materialized data structures that are joined at query time.

Dan: Ok - please justify this. Was this on Teradata? Oracle? DB2 UDB? Netezza? DatAllegro? SQLServer? Paraccel? or something else? What was the underlying hardware? How much money did the customer spend on making the infrastructure viable?

My first run was on a Unisys R7600 using SQL Server 2008. I have since done this on a HP DL980, using SQL Server 2008R2. The cost of such a system is in the 200.000 USD range. MPP is not really required to get this sort of speed.

dlinstedt Mon Aug 29, 2011 7:01 am

Hi Thomas,

Before I get going... the first thing I'd like to say is: I'm not here to debate performance. The problem we get in to when debating performance is it is all just here-say unless we put the two models to the test, on the same infrastructure with the same "data sources" loaded in complete, and tuned properly. Even then, it's like comparing apples and oranges. So, the best we can hope to get out of a "head-to-head" performance test will be approximations. Furthermore, we would need sufficient data to overrun database in-memory caches, and in-memory disk caches (I/O caches and read-aheads) in order to make the test valid. Debating performance with you without hard-fast head to head numbers on the same infrastructure on the same machine would be like comparing steel to sand, two completely different things - and of course - in the end it's all just hot air without justification.

I do realize that we are both speaking from experience, great, wonderful... but just because we have experience doesn't mean you and I have seen the same systems, or even designed the same systems. Furthermore, if you've not been trained as a certified Data Vault modeler, then you have not had the full experience - and are missing some of the nuances or finer points of the data modeling techniques and standards that are available.

What I will say is this: Yes, performance is important, yes - the size of the "live or used" data set matters, yes, parallelism, partitioning, and the ability to co-locate data sets matter. Design matters.

One point that is very difficult to argue over is end-user access. Why? Because the Data Vault is built to be a back-end data warehouse, that feeds star schemas, cubes, and other data modeled data marts on the front end. With this in mind, the Data Vault is tuned for high speed loading, AND high speed querying (but, of course - it is NOT tuned for ad-hoc querying) - it is tuned for specific processes (consistent, repeatable processes that use the same indexes and the same data sets over and over again) to produce the front-ends.

In tuning: you can tune for ad-hoc access, or you can tune for processing speed - it is really difficult to tune the architecture (physical design) for both, unless you have in-memory constructs. This is why we see the rise of "in-memory" processing from traditional RDBMS vendors, and the rise of in-memory processing from appliance vendors - as well as query vendors.

Now, regarding "large Hubs and Large Links"... that depends on what you mean by "large?" and how you vertically partition and horizontally partition can make all the difference in the world, especially when the implementation is built on a system that is truly MPP enabled. Back to the point: the average "row width" of a Hub is approximately 28 to 100 bytes (depends on if you use unicode, and if you have a VARCHAR as a business key). With that, assuming a 64 kb block size (you said could be attained, ok... can you point me to the documentation that provides this, I'd love to look it up?).... means approximately 640 rows per block for the Hub. Especially if you set the fill-factor to 100% (% free to zero). It's 100% inserts which lead to zero fragmentation. The Links, well, the links average size (assuming 5 foreign keys @ 8 bytes each - which most links have 2 or 3) + load date (8 bytes) + sequence (8 bytes) + record source (varchar(12)), you are looking at a row width of approximately 70 bytes. Which again, for 64kb block size means approximately: 914 rows per block. Also fill factor of 100% (% free set to zero), and it too is 100% inserts, which lead to zero fragmentation.

One more note about SQLServer 2008 R2, is more about the Windows OS underneath... SQLServer still uses PageFile.sys - (because it uses the Windows OS)... Windows OS (not even 64 bit OS) is currently fully-parallel access to PageFile.sys, at the core level - Windows is incapable of "opening multiple threads at exactly the same time" to perform I/O against the PageFile. which means the RAM swapping that Windows OS does is still capable of becoming blocked I/O on a single process waiting for PageFile. This is NOT true of Linux, Unix, and Mainframe systems. This one small detail can make a huge difference in large systems that desire parallelism. In other words, Windows and SQLServer are NOT truly pre-emptive, even in their RAM swap algorithms (which means there can be tremendous bottlenecks for parallel threaded code, AS WELL as data sets that are being run in parallel processes).

Which as we both know, fragmentation is the enemy of performance on any database. Devices like Netezza add on top of this built in compression (BUT because of the way Hubs and Links are built), column compression (value compression) can only occur for the load dates and record sources. But, even then, that's approximately a 50% compression rate, increasing the number of rows per block (doubling it).

Anyhow, once you run these numbers with co-location, AND parallel query (true parallelism - pre-emptive tasks) which SQLServer (because of the OS) can only execute in non-pre-emptive mode, meaning some tasks can BLOCK the I/O for longer than they should... Anyhow in truly pre-emptive parallelism, the ideas of running multiple parallel tasks (parallel query threads) against different data sets (or computational nodes in an MPP world) would provide you with near-linear performance gains, compared to singular tasks (or table scans) of denormalized data sets. This has been proven, over and over and over again as the only way to scale beyond a certain point. If anyone want's to know more, there are hundreds of books, articles and proofs available. You can start with HPC (high performance computing) https://computing.llnl.gov/tutorials/parallel_comp/ as an example of a high level discussion.

Now, a join index in the Data Vault is not cheating... In fact, it IS the Link table, the Link table (many to many) relationship is the only way to join two different structures together. The Link table is "stored in it's own right" as a part of the design. And no, you can't co-locate both Hubs (that's part of the design, is to achieve balanced data across the MPP nodes), but you can co-locate one Hub (the largest Hub) with its' corresponding index (both of which are very very narrow, and can fit many rows per block).

Anyhow, I teach all of this implementation, parallelism, and query & more, in my performance and tuning classes - as well as my implementation classes. Those of you interested in the "science" behind the scalability, and the implementation of the best practices can learn more at: learnDataVault (dot) com - I'll be releasing on-line classes between now and the end of the year. Drop me an email if you are interested.

With normalization the point is this: truly pre-emptive parallel threads executing against large numbers of rows, with a SMALL I/O count (because they can retrieve large numbers of rows because of the rows per block), can execute faster than a single table scan against a wide table, especially when the width of that table exceeds the optimal rows-per-block, and begins to double, triple, and quadruple the I/O counts. This type of model (table scans) are not sustainable in large data set environments - this is also why each Netezza box has an upper terabyte limit, and why you have to string multiple Netezza boxes together to achieve and manage growth.

Anyhow, the proof is out there regarding parallelism, partitioning, and row-size...

Regarding width of rows, it's mathematical in nature - it has nothing to do with the code base. Again, try this experiment on multiple database engines: run a single table scan against 1k, 2k, 4k, 8k row sizes (consecutively). Built many many rows (with compression off, no parallel query, and no partitioning). What you find is this: for 64kb block sizes, you can put 64 rows, 32 rows, 16 rows, and 8 rows per block respectively. Now load the tables with 5 million, 50 million, 200 million, and 500 million rows. Test the performance against each by querying ALL columns, and ALL rows (it is impossible to tune with just a simple select count(*)) Anyhow, I go through this test case and discuss the nature of testing in one of my performance and tuning on-line classes. I outline the entire test, the testing procedures, and discuss the results. There is too much to this test case to explain it all in this one test. But, the end result is this: the wider the row sets, the slower the scan will become - it will begin to push the I/O count up (double, then triple, then quadruple it) - there is then a sweet spot to denormalization that the data modeler should not exceed. The sweet spot will change depending on the block size, and the hardware and database platform you are on. You can find out more by contacting me: learnDataVault (dot) com (slash) contact-us

Oh yes, your claim: "trying to jam the joins in to an in-memory access".... I never made such statements, either the purpose of a Link table is not understood, or the "lack of indexing" that is required in the Data Vault (other than the primary and foreign keys) is not understood. One more point... The Data Vault is tuned for processing speed (loading and mechanical querying - meaning NO AD HOC)... Again, we do not need to "jam" the joins to an all-in-memory process, quite the contrary, a true MPP system (like Teradata) will appropriate the joins to the right execution node. DB2 EEE has it's own method, Netezza has joins at the firmware level and uses flat-wide tables underneath (building on the COBOL notions of data access, but adding compression and data access algorithms (hash based) to be able to compute where data is living on disk... Other databases do things differently, even SQLServer clustered machines and it's new Data Warehouse edition with partitioning does things differently (especially after it's purchase of DatAllegro which was PostGreSQL based).

One last point I wish to make is this: Just because you've seen Data Vault implementations out there, doesn't mean I've been involved, doesn't mean it's been created correctly - especially if the model has not been audited or verified by me, doesn't mean it can scale (depends on the hardware and database system, as well). Just like any data model - people can & have "gotten it wrong." Especially because they've built it without proper training, or proper review. I've seen the same things with 3NF, and Star Schema, and just about every other data modeling technique out there - so it's not fair for you to make comparisons to "models you've seen" without some basis of justification, letting me know who, when, where it was built - and checking to see if it was built by Certified Data Vault Consultants - or if it was reviewed by me.

thomaskejser Mon Aug 29, 2011 12:14 pm

Hi Dan

First of all, let me just direct you to the Windows Internals book by Mark Russinovich. Here, you will read about the multi threaded nature of the I/O system in windows and how this has nothing to do with pagefile.sys. You really shouldn't engage in advise about Windows and how database run on top of this platform without having the necessary background, which I am sorry to say, you display a lack of. The section you might want to read up on in I/O completion ports and how they work. Incidentally, you will also find information about how block access works in Windows (to get large block scan for example)

Secondly, while I agree that performance is only one aspect of a data model - it does matter since it determines the eventual success of the installation (no performance = too long batch window and poor user experience). And it seems you are interested in making this point too, since you make statements about the science of tuning and mathematics of database.

Third: Let us define "large" tables as tables that don't fit in memory. As you rightly point out - pure in-memory engines do not form a basis for good compared of any model (since they don't incur any of the unique challenges that big data sizes create). So, going from the point on three large tables join (let us just say two hubs and a link). In order to perform this join on a large dataset (say: if you load a big fact table or response to a query at low granularity), you have several options. Let us just have a look at some common ones

1) Scan one table, loop into the next two
2) Scan one, hash the other two
3) scan one, co-located merge into the other (Assumes same ordering on phyiscal structure)
4) Denomalize the co-location column and merge
5) Use bitmapped or hash based disk access

Now, all options the do a hashes in memory are very expensive - because it is expensive to "spill" a hash. If the hashes are large and DO fit in memory, you get trashing. So, hashing in out on the three big table join.

Going after loop and merge strategies. Loops are expensive, since they are random, small block IOPS (expensive, slow, pointer walking, lg n * m). Bitmaps and hashes speed this up, but by a fraction of the speed achievable with a column store - and you still have to pay the compare cost on the CPU cores and in memory (which you would not have to do if you were de-normalized)

Merges require that all tables are sorted the same way. This either mean large memory usage (incurring the trashing effect at concurrency) or that you denormalize the merge column into all the joined tables to get proper co-location. Unfortunately, this only works for the simple cases.

Fifth: If you are extrapolating from a "wider rows incur more IOPS" into guidance for a data model, I am sorry to say: but you are seriously over stretching your assumptions. Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies). And even if you don't have column stores in whatever database engine you target, you can use aggregate views (materialized) to achieve the same effect of bringing down row width for queries that need the fast scans (since it seems we have now established that spending even more disk in an already expensive system is a not cheating). Again, data Vault does not address this issue, you simply fast disk access for more, slower joins. It does not follow from "wide rows are slower" that de-normalizing is a good idea.

Ohh.. and DataAllegro was not spawned from PostGreSQL, it came out of Ingress. Please get your facts right before you go out there preaching your data model.

dlinstedt Mon Aug 29, 2011 2:17 pm

First, I don't approve of the fact that you think the forum is a place for public attacks. If you really want to help me and the rest of the readers understand your points, then you will kindly provide us with on-line documentation (as I requested) that shows us where I have got it wrong. Like all humans, I make mistakes, and for that I am sorry. However, please don't cross-the-line to hit below the belt, and furthermore, I am attempting to have this discussion with you, even though you have not bothered to take the time to become a certified Data Vault Modeler... just keep that in mind please.

Performance matters, yes it always matters - but again, until you have side-by-side facts (which I keep asking you to produce), you cannot make the claims you are making either...

For the data modeling aspect it's quite simple:
Take a Dimension, strip it down to a type 1, remove all other elements except the Business key and the surrogate sequence. Do this for one more dimension. Then, build a factless fact table with these two surrogate dimension keys. What do you have? Two Hubs and a Link.

Now if I am to understand you correctly, your claims (against the Data Vault) are that a) it doesn't perform because of the joins, b) that it can't scale, c) that it is NOT better than one "big denormalized table"...

So to summarize (and put this quite succinctly): If what is under the covers (with the Data Vault Model) is in fact a hybrid approach that uses PARTS of star-schema design, then to knock the Data Vault model is by default to knock the Star Schema design... Is this correct? So, if it doesn't perform because of the joins, then that is a problem that will carry through to ALL star schemas (given the design parameters set above). Generalizations get us ALL in to trouble, and I am sorry if I generalized earlier... again, I defer the performance discussion until we have head-to-head numbers.

First of all, let me just direct you to the Windows Internals book by Mark Russinovich. Here, you will read about the multi threaded nature of the I/O system in windows and how this has nothing to do with pagefile.sys. You really shouldn't engage in advise about Windows and how database run on top of this platform without having the necessary background, which I am sorry to say, you display a lack of. The section you might want to read up on in I/O completion ports and how they work. Incidentally, you will also find information about how block access works in Windows (to get large block scan for example)

I am not referring to "block access". I am referring to the physical I/O disk fetch that happens at the firmware and CPU levels of the machine. At this level I/O BLOCKING occurs in sequential fashion, causing other "threads" to wait on the CPU until the disk has returned from it's non-pre-emptive interrupt. It has nothing to do with the upper level operating system, and everything to do with the I/O Subsystem, and the firm-ware at these levels. This is also known as the BIOS layers. Can you tell me definitively - is Windows OS pre-emptive or non-pre-emptive multi-tasking?

Secondly, while I agree that performance is only one aspect of a data model - it does matter since it determines the eventual success of the installation (no performance = too long batch window and poor user experience). And it seems you are interested in making this point too, since you make statements about the science of tuning and mathematics of database.

Of course performance matters, it's one of the major reasons for having this discussion - and you mis-interpreted me. What I am saying (again): is that it's not worth debating here until and unless hard numbers in a head-to-head approved case study can be produced and published. Otherwise, it's just hot-air between us.

Third: Let us define "large" tables as tables that don't fit in memory. As you rightly point out - pure in-memory engines do not form a basis for good compared of any model (since they don't incur any of the unique challenges that big data sizes create). So, going from the point on three large tables join (let us just say two hubs and a link). In order to perform this join on a large dataset (say: if you load a big fact table or response to a query at low granularity), you have several options. Let us just have a look at some common ones

Ok - again, how does this "case" differ from the star-schema design I suggested above? One way: the case you suggest is "querying a single table for a table scan". What happens when you join a type 2 dimension to a fact table?? From a key perspective the query looks exactly the same as it does against the Hubs and Link structures I suggested above. It just so happens that Hubs and Links are capable of storing more rows per block.

Going after loop and merge strategies. Loops are expensive, since they are random, small block IOPS (expensive, slow, pointer walking, lg n * m). Bitmaps and hashes speed this up, but by a fraction of the speed achievable with a column store - and you still have to pay the compare cost on the CPU cores and in memory (which you would not have to do if you were de-normalized)

Pointer walking? Small block IOPS? I did not state this was the case... in fact I said quite the opposite (again I am mis-interpreted). I stated at the beginning that the block sizes were to be 64kb... regardless of data model, this means that they are fairly large block IOPS - so I am missing the argument/point you are trying to make here. Now, in regard to a column store... I had no idea that we were even discussing a column store - I thought this was a discussion about Data Vault on SQLServer 2008 R2 (which to my knowledge is not a column store).

Merges require that all tables are sorted the same way. This either mean large memory usage (incurring the trashing effect at concurrency) or that you denormalize the merge column into all the joined tables to get proper co-location. Unfortunately, this only works for the simple cases.

If this only works in simple cases, then why is Teradata enjoying huge success in the market place with very real, very large, and very complex cases? It IS a set / a tuple of join columns (sequences) that work to be co-located on specific nodes, to allow the separated nodes to do the work before pulling the data set off on to the interconnect bus for physical join needs. It does not flood RAM. if this only worked for simple cases, why are database vendors playing "catch-up" to NoSQL, Hadoop, and other vendors (including column based stores) where performance of extreme normalization has taken over?

Fifth: If you are extrapolating from a "wider rows incur more IOPS" into guidance for a data model, I am sorry to say: but you are seriously over stretching your assumptions. Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies). And even if you don't have column stores in whatever database engine you target, you can use aggregate views (materialized) to achieve the same effect of bringing down row width for queries that need the fast scans (since it seems we have now established that spending even more disk in an already expensive system is a not cheating). Again, data Vault does not address this issue, you simply fast disk access for more, slower joins. It does not follow from "wide rows are slower" that de-normalizing is a good idea.

Wider rows that overflow the optimal row-to-block ratio WILL cause more IOPS, they also caused CHAINED ROWS to occur. When a row becomes too wide to fit within a block, it "straddles" in to the next block, causing 2 I/Os instead of one to read the same amount of information. Once the data has become this wide (or keeps getting wider), the I/O's do in fact double, then triple (if the data is widened again), and so on. It also leads to additional fragmentation and in the case of high volumes, massive performance issues.

Regarding aggregate or materialized views - they (at least in Oracle) form physical wide tables. In SQLServer, I am not exactly sure what they build under the covers, as I admit, I haven't studied it. However, a materialized view in Oracle IS a single wide table that represents all the columns in the view (Oracle folks, feel free to correct me if I'm wrong). Bringing down the row "width" only happens when you put less columns in the Materialized view than are in the table. I never said that "spending more disk in an already expensive system is not cheating". If I haven't said it yet, let me say it now: Because of the normalization, I do not store copies of the data sets across multiple tables, thus reducing (overall) the total amount of data actually needing to be stored. With the reduction in data size, is less disk required for storing the SAME data set as would be stored in a type-2 dimension.

Column stores are there to address exactly that problem (this incidentally, has been known since the 70ies).

I did not think that our discussion included column stores. Column stores have their own scalability problems, however that said: Column stores don't give a hoot about which physical data model is thrown at it. They "vertically partition" every single column, there is no concept of "table" in a column store, except in the logical sense for organization and query methods. Last I checked, SQLServer 2008 R2 was not a column store - and I thought we were discussing the merits of Data Vault on a SQLServer and Windows environment?

Of course Column stores change the game, just like Netezza with it's flat-wide denormalized table structures. The reason flat-wide works on Netezza is (again), because of the firmware changes along with the column based compression that is automatically built in. But Netezza is NOT a column store.

One more point: if the discussion needs to be about column stores and performance... then why is it so bad to "normalize" data sets and join data together? Column stores do it - they take physical data models literally to 6th normal form.... The Data Vault takes it one step further than dimensional modeling, but not as far as a column store...

So, if table scans were the answer to massive performance and massive scalability, I ask again: Why haven't ALL the vendors jumped on the Netezza bus? If Table scans are so great and powerful, how come Teradata enjoys so much success in MPP and normalized modeling?

Best wishes,
Dan Linstedt

ngalemmo Mon Aug 29, 2011 5:24 pm

Why haven't ALL the vendors jumped on the Netezza bus?

Netezza's architecture is patented. Their combination of an FPGA and disk is unique and integral to their performance advantage for certain operations (i.e. massive queries) over traditional MPP architectures.

If Table scans are so great and powerful, how come Teradata enjoys so much success in MPP and normalized modeling?

Netezza has been very successful over Teradata in heads-up comparisons.

thomaskejser Mon Aug 29, 2011 5:26 pm

Hi Dan

I don't mean to attack you as a person. But please understand that when you make claim that is simply wrong (as with the Windows OS) - and present yourself as an authority in it - then it it not a blow below the belt to point out that you are not doing yourself a favour.

To answer your question about Windows: it can be BOTH a pre-emptive and voluntary yield system. The kernel itself is pre-emptive by default - but it is possible to build other scheduling mechanisms on top of it. This happens to be the case for SQL Server, which has a user mode, voluntary yield, scheduler (called SQLOS). However, note that this really does not matter, as I/O operations are handled by threading system dedicated to it. Because it takes micro or milliseconds to issue IOPS (as compared to nanoseconds for cpu cycles) it is therefore a good idea to either task switch or pipeline them for best system throughput. This also means that you will want the most out of every I/O operation - which I think we agree on?

I am not sure I follow your point on disk access fully. Are you claiming that sequential scans are unachievable or somehow not good? Please elaborate here. To give you some more information about disk subsystems (which I work with a lot): There is nothing inherently blocking at the hardware level that make sequential a problem, so we can rule that out as an argument. Also, note that you don't need to have sequential to optimize disk performance, you just need to have large block sizes. To give you some hard numbers on that:a 15K spindle will typically deliver within 80-90% of the theoretical max sequential (which is roughly 125MB/sec) if the block size is above 512K. Both Oracle, neteeza and SQL server are perfectly capable of generating this pattern, especially in a star schema. Compare this with a 64K block (for example) fetch for loop strategies: even if you squeeze in a lot of rows in a block you are still getting random, low block size, IOPS. Typically, a 15K full stroked spindle deliver 220 random IOPS, so those 64K blocks only land you in the 14MB/sec. You have to squeeze in a LOT of rows to get even close to the sequential scan.

With regards to the model I am arguing for: I am NOT arguing for a single, super wide, table. I am taking about a star schema. I hope we agree that column stores do NOT require all the data to be pre-joined, they can be applied to the fact table alone, while maintaining the benefits of row bases storage (fast singletons) at the dimension. Can we agree that in such a model, having fact tables with many hundred columns is not really an issue we need to concern ourselves with? If you disagree, the we must assume the Vault would suffer the same problem in the link tables - yes?

With regards to your class and the online evidence: I think it would be fair to say that the burden of evidence rests on your shoulders, in this forum, for showing why Data Vault as an EDW model is superior to dimensional models. This is after all the question this thread concerns itself with. Since we are both in this forum, I hope we can agree about the immense contribution that Kimball has made to warehouses and that neither of our experiences match those of Kimball - we are both the upstarts here and until we can show otherwise, the Star Schema is still one of the most proven DW models put there, even at very high scale (I have worked on a 1PB star schema). Please also note that I am not fully agreeing with star schema as the full solution, especially when it comes to history tracking (where we to a large extend agree).

The point I am trying to make here is that your assumptions about what is "wrong" with a star schema are not issues with the model technique, but with the implementations people do. I am also trying to point out that normalization, as described in your data vault, is not the answer to those problems.

With regards to Teradata - they have a wonderful database engine, but it too suffers under the same laws of computing that all the other engines do. The fact that they are successful (as are others) i don't think can be used to advance any argument either for nor against the Data Vault technique. Take SAN as a parallel example: A successful storage solution that many companies swear by - but that does not make it optimal or even very good for a lot of scenarios. In fact, SAN implementations, and their insistence of "magic solutions", have seriously damaged many database projects by over promising and under delivering on storage solutions (at a very high price point).

I think it would be useful to advance this argument by using a concrete modeling example and then going after those hard numbers you ask for. This I think would remove confusion about which specific technologies we are talking about and how to compare models. I would very much welcome such an argument.

John Simon Mon Aug 29, 2011 6:48 pm

I am currently leading a team on a large Teradata implementation for a Telco.

As far as performance goes, normalised structures are no where near as performant as a dimensional model. Inmon recognises this which is why he recommends using Star Schema data marts for the access layer, and Teradata recognise this, which is why I am building summarised fact tables and conformed dimensions in a Teradata database, as directed by Teradata consultants.

Nothing beats dimensional modeling for performance of queries.

dlinstedt Tue Aug 30, 2011 6:11 am

1) A head-to-head test for performance is necessary, but are you going to give your time to do it? I too would welcome these numbers.
2) the Data Vault model is built based on semi-normalization. As indicated, it is JUST like a star schema dimension type 1, stripped down to the business key, and a factless fact. The star-join optimizer can & does work to the advantage of the Data Vault much the same way it works with star schemas
3) you have not shown definitively that more rows per block = slower performance. You stated that it's higher random I/O, but the only way it could be higher random I/O is if there is "more fragmentation" in the Data Vault, please explain how this is, and provide a case for why this happens, this has NOT been my experience
4) You insist on continuing to compare a front-end star schema delivery system with a back-end data vault EDW storage system. Why? I've said before: the Data Vault is for back-end data warehousing, NOT front end delivery. I have always said: put the star schemas on the front-end for delivery, be it virtual or not.
5) The Data Vault model solves other problems beyond performance, that the star schema has suffered from (when the star schema is used as a data model for enterprise data warehousing).
6) there is always a trade-off between flexibility and performance (normalization/denormalization). The hardware vendors are always pushing the boundaries for performance and capabilities. Before they began working with internal algorithms, firmware, and hardware layouts (like Netezza) - star schema denormalization suffered greatly in performance, I've seen it and lived through it first-hand.
7) you argue the finer points of column based databases which are highly normalized (vertical partitioned down to the individual column level for the physical implementation), yet you also argue that denormalization is the way forward? How are both possible at the same time? I would suggest that if Column based databases are to make it, then they need to find a way to break the scalability barriers they are facing today. Sybase IQ has an upper limit (which a large credit card company had to dump them because the engineering team at Sybase couldn't make the product scale beyond this specific point). The same goes for ParAccel, and Vertica and other column based data stores. My point is this: either Normalization is the "best way forward" or it's "not"... you can't have both and scale in to extremely large petabyte sized environments. If this were true, Teradata and MPP would not be succeeding at such large volumes where many other vendors can't even compete today.
I am familiar with Netezza, I know they have patents, but here's the point: when there's money to be made in a specific niche, companies always find a way to compete - they often file competing patents. I ask again, if denormalization is the way forward for all vendors, why haven't the rest of the vendors of the world jumped on the denormalization & extreme scalability & performance bus? Why was Netezza the only one? And yes, I know all about the FPGA. I toured the Netezza engineering in Massachusetts many years ago with Foster Hinshaw, co-founder of the company.
9) Netezza has beaten Teradata in heads-up comparisons. True. But ONLY up to a certain SIZE. There is an upper limit to how much "denormalization" can manage and handle before it can't scale. If you know of a comparison between Netezza and Teradata that is at or above 500 TB in size, and where Netezza Beats Teradata in this category, then great - I'd love to read about it. I've not seen one yet. If denormalization is the way forward, why aren't we all building star schemas on Mainframes?
10) you still do not answer my questions by providing the references I seek, re-read the posts for the questions that I ask over and over again.

How on earth did we get to the point where we are debating "vendor implementations" anyhow? This is not the point of this thread. We are and were supposed to be discussing the benefits and drawbacks of modeling techniques.

Please point out my assumptions about star-schema that you say are "wrong." I'd like to get back to the original claims/issues/thoughts which have everything to do with modeling, and nothing to do with vendors or hardware.

Please also note that I am not fully agreeing with star schema as the full solution, especially when it comes to history tracking (where we to a large extend agree)

?? Then why are we having this discussion in the first place, as this is my original point for using Data Vault ??

in other words: use the Data Vault for back-end historical data tracking, use star schemas, flat-wide, cubes, and whatever is good to deliver the data to the business users... This has always been my original claim.

Cheers,
Dan L

dlinstedt Tue Aug 30, 2011 8:39 am

With regards to block sizing: I am referring specifically to database block sizes (as this is the consistent measure by which most RDBMS machines can be altered). Note: I am referring to Oracle, Teradata, DB2 UDB (EEE), SQLServer, Sybase, and MySQL. I am NOT referring to File Allocation Size (which is a different measure, and is not available in all databases). NOTE: I have tried to provide the links, but I am not sure this forum will allow it. Because of this, I have provided the titles and authors of the articles so they can still be referenced.

You and I have had a discussion about block sizes, pagefile.sys, and pre-emptive multi-tasking. I have provided references to each of these points below.

Regarding SQLServer block sizes:
SQLServer 2008 R2:
"SQL Server uses a fixed block size as a page of 8 KB. While it reduces a DBA’s ability to fine-tune I/O operations, it does provide for greater transportability and automation. With a fixed block size, SQL Server can automatically optimize I/O operations and memory utilization since it also knows exactly how much space is utilized with each I/O operation."
Microsoft: Practical SQL Server 2008 for Oracle Professionals, Written by Microsoft engineering, October 2008, found under the header: Multiple Block Sizes on Page 20 of the document (I would provide the link, but we are not allowed to post links here) Microsoft Engineering Technical Document

This is different than the File Allocation Size: which CAN be 64kb:
File Allocation Size
An appropriate value for most installations should be 65,536 bytes (that is, 64 KB) for partitions on which SQL Server data or log files reside. In many cases, this is the same size for Analysis Services data or log files, but there are times where 32 KB provides better performance. To determine the right size, you will need to do performance testing with your workload comparing the two different block sizes.
Found in: Disk Partition Alignment Best Practices for SQL Server, written for 2008, found on-line in technical articles, written by: Jimmy May, Denny Lee, in 2009

I found a reference for you to prove that PageFile.sys can and DOES have an impact on the performance of SQLServer 2008 (You told me that PageFile is not used by SQLServer, and that I should get my facts straight...)
Pagefile Configuration for SQLServer 2008 (Can be done, and IS used by the OS)
Document: Troubleshooting Performance Problems in SQL Server 2008, MSDN Technical Article
Written by: Writers: Sunil Agarwal, Boris Baryshnikov, Keith Elmore, Juergen Thomas, Kun Cheng, Burzin Patel
Indicates PageFile Settings are utilized to control the performance of "swapping" and memory blocks in Windows OS.

In fact: SQLServer Central posted this article:
Pre Check List for 64 BIT SQL Server 2008 Cluster Installation, Author: By Rudy Panigas, 2009/12/31
in which: the check list has the following item:
Create a Local Disk for Page File
Create a dedicated disk on the locally attached storage unit that will contain the operating systems paging files. See System Performance Settings for more information on how to assign the paging disk to this dedicated disk.

Clearly, SQLServer2008 R2 DOES use PageFile.sys, and PageFile.SYS configuration & usage DOES have an impact on performance of both normalized and denormalized data models, therefore any arguments about pagefile.sys and it's utilization are a moot point, UNLESS it is a barrier to parallel operations (which I claim table scans have an upper limit, when that limit is breached (data width), parallelism cannot be achieved to break down the query and go after the data set underneath), thus causing potential locking at the PageFile.sys levels.

More on the PageFile.sys components:
Paging
This is an issue that has cropped up with the advent of 64bit OS’s. Windows Server 2003 changed it’s priority for non OS processes. This severely impacted SQL Server and created a heck of a time for SQL Server support. What happens, is when the OS feels it is experiencing external memory pressure, it begins choosing victims to steal their memory. In which cases non system level process will find their working set swapped to the paging file. The most obvious indication of this is the amount of page file usage. It should preferably be under 8%. I have seen some SQL Servers run ok on 10% however this is pushing it.
Written in: Troubleshooting SQL Server Slowness, Author: Derek Dieter, June 2009.
Link: SQLServer Trouble Shooting

Regarding pre-emptive & non-pre-emptive...
Now, SQL Server is in Non-Preemptive mode by default and it works fine. When CLR, extended Stored Procedures and other external components run, they run in Preemptive mode, leading to the creation of these wait types.
Article: SQL SERVER – PREEMPTIVE and Non-PREEMPTIVE – Wait Type – Day 19 of 28, written by: pinaldave

Joe Stack, from Microsoft: PERFORMANCE TUNING WITH WAIT STATISTICS
Wrote the following: on slide: 38
Preemption is act of an OS temporarily interrupting an executing task, involuntarily
•Higher priority tasks can preempt lower priority tasks
•When this happens -can be expensive
•Preemptive mode used in SQL Server for external code calls, CLR with an UNSAFE assemblies, APIs that could block for extended periods (example –extended stored procedures)

What I was referring to was NOT preemptive/non-preemptive behavior of SQLServer, but the behavior of the Windows Operating System. I hereby concede that SQLServer 2008 (in the form of external code calls) can execute preemptively. However, this says nothing about SQLServers' behavior in executing SELECT queries, and if the SQL task is preemptive or not. I am still looking for references to this point, because it makes a difference in parallel query execution abilities and performance.

Why? Because clearly the SQLServer 2008 engine replaces SOME functionality of the OS levels, however, it still relies on Windows OS calls and PageFile.sys to manage swapping, and thus - in a non-preemptive situation, can have a process become "blocked" (waiting on disk I/O) and locked in the CPU for execution cycles until the disk returns with answers. Even though this operation is milliseconds or microseconds, it can & does have an impact on parallelism, performance, and scalability.

As pointed out in yet another Microsoft Technet article:
Cooperative scheduling does not solve all scheduling problems. For example, tasks that do not fairly yield to other tasks can consume all available computing resources and prevent other tasks from making progress. The Concurrency Runtime uses the efficiency benefits of cooperative scheduling to complement the fairness guarantees of preemptive scheduling. By default, the Concurrency Runtime provides a cooperative scheduler that uses a work-stealing algorithm to efficiently distribute work among computing resources. However, the Concurrency Runtime scheduler also relies on the preemptive scheduler of the operating system to fairly distribute resources among applications. You can also create custom schedulers and scheduler policies in your applications to produce fine-grained control over thread execution.
Document: Comparing the Concurrency Runtime to Other Concurrency Models, July 2010

A little further down in the article it discusses the scheduling and threaded models of SQLServer 2008. The question is: does a table scan read-ahead end-up blocking I/O in order to read what is consiered a wide-row table (4k or 8k row size) when you want all of the data from a SQL Select? OR will parallel query with sufficient engines, and more rows per block end up with more I/O's in parallel (taking less overall time), less I/O's in parallel (also taking less overall time), or the same amount of I/O's in parallel (should also take less time).

Why less time? Because parallel tasks going after the complete data set, split the amount of work in to parallel execution units. Do they force the disk to work harder? maybe, do the take less time? usually - this has been the proven case in parallel computing since the understanding and implementation of parallelism in computing systems. If this weren't the case, we wouldn't have multi-threaded CPU's today, nor would we have ever implemented parallel query engines nor threads at the database levels.

Regarding FPGA and Netezza:
Article: FPGA: What’s in it for a Database?, written by: Rene Mueller, Jens Teubner
Parallelism and processing power now shifted away from CPU, closer to data / disk.
FPGA has similar dimensions as a CPU, consumes 5 times less power and clock speed is about 5 times less.
Filtering out unnecessary data.
My point:
FPGA moves the parallelism out of the software, on to the hardware/firmware layers. Parallelism is still parallelism no matter which way you slice it. The argument put forward by Thomas that I have a hard time swallowing is: "parallelism is bad (read: normalization is a form of parallelism), table-scans and sequential access is better/faster" I cannot believe this to be the case. Especially when hardware vendors like Netezza PUT parallelism at the hardware level to handle the "joins" or "join" problems against the data sets. Netezza does one better by offering column compression (forced), thus reducing I/O and storage needs, and increasing performance. Netezza has yet another trick up it's sleeve - it uses HASH based functions to avoid indexing, but this IS an MPP data distribution (normalization) technique, based on key-sets, the data is layed out / distributed to specific disk blocks. It just so happens that what executes next makes all the difference: FILTERING at the hardware level. I am suggesting (with normalization and the way the Data Vault is laid out) that anyone can take advantage of this. With parallel query optimizers against the Data Vault, both table and column elimination (filtering) can take place - unfortunately it's just not at the hardware/firmware level, which leads us to believe that "joins are bad or slower performing than normalized tables".

Ahh but there are limitations:
Limits to this flexibility are set by the availability of each type of resources. The most apparent limitation is the number of lookup tables of a given FPGA chip, which can limit the complexity of the logic circuit that can be programmed to it. Other potential limits include memory requirements, the availability of additional in-silicon functionality, or the bandwidth of the interconnect fabric.

If normalization was so bad, then it would make sense to use mainframe computing power (with vast ability to handle super wide file stores) to build and house our data warehousing systems.

I want to make one more point: I agree with everyone, that OVER-NORMALIZATION can lead to a down-swing in perfromance that has to be compensated by hardware layers (hence the COLUMN based appliances, they do EXACTLY that - they over-normalize, but make up for it with column compression, and hardware performance). I still maintain that over-denormalization (rows that are too wide) will lead to a down-swing in performance as well, again, must be compensated by hardware layers (hence Mainframes and Netezza).

The secrets to any and all performance is striking a balance between architecture, flexibility, scalability and purpose.
A balance between parallelism, normalization, hardware capabilities and designs must be applied for common sense and best possible outcomes.

The purpose of the Data Vault model is to be a back-end historical data store, NOT a front-end ad-hoc query machine.
The purpose of the Star Schema model (my opinion only) is to be a front-end pivot, semi-historical data store, not a back-end fully accountable historical machine.

One more note: different "data model designs" will perform differently on different hardware, and different infrastructure.... hence the separation of Logical and physical data models over the years.

My questions to the readers:
So: if they split the work up (divide the processing), and then they suggest that horizontal (range/hash) partitioning of data is another scalability and performance measure, why can't vertical (column based) partitioning further the performance?
What the Data Vault proposes is exactly that: horizontal, and vertical partitioning, working together with parallel query, and parallel processing engines to reach maximum even distribution of work across execution units. Does it mean more joins? Yes. Is it slower than a single table scan? The answer to this depends on: block size, pagefile (or operating system that doesn't use pagefile), disk speed, tuning switches at the database level, amount of data (overall in the table), row-width of each row, number of parallel processes (that can be executed), read-ahead buffers (for table scan purposes).

The only way to truly test the "performance" is to setup a specific and agreed upon test case where the data models are made public and approved by both parties, the hardware, software, and database versions are well documented and understood. Then, to try it with different data row widths, as well as different numbers of rows (overall data size). Finally, to run the same tests on different database engines. Realizing of course that the numbers from different engines will provide apples and oranges, (no more than a rough approximation, because there are too many variables to count), but it will still give a good approximation for results across the multiple platforms.

One final note: I was unable to locate ANY statements that described the physical and exacting operation of PageFile.sys - something that said: "A single write to the pagefile or a single read from the pagefile is [or is not] a blocking operation." Because of this lack of evidence, there is no way to conclusively determine if in fact PageFile.sys is truly a "preemptive" operation. I do realize that the OS has preemptive scheduling layers, and that's great - but when it comes down to brass tacks, the answer to this question needs to be found and documented.

hang Tue Aug 30, 2011 9:12 am

dlinstedt wrote:use the Data Vault for back-end historical data tracking, use star schemas, flat-wide, cubes, and whatever is good to deliver the data to the business users... This has always been my original claim.

Virtual or physical star schema? There is a fundamental difference between them which basically draws the line between Kimball and Inmon.

I would be more than happy to see some standard and disciplined modeling technique like data vault that underpins the MDM and tracks all the change history as another upstream data layer to feed the dimensional data store (DDS), just like DDS feeding the cube. In this model, it would be a lot easier to balance ETL process with two distinctly different focuses. One is dealing with data integration, data entry through application, cleansing and data quality etc.. The other part of ETL is mainly focused on SCD, surrogate key substitution etc..

I think we all agree we definitely need star schema for cubing and trend analysis. However some of the key elements like fact grains and dimensionality need to be clear at physical level. When we are debating about the performance, let's not forget that ease of use is another purpose of dimensional modeling, or data warehousing in general.

Under this structure, MDM layer could also provide near real time reporting if needed. It also helps to rebuild DDS and cube without any hiccup. I would not be surprised to see many BI vendors will come up with similar product line.

ngalemmo Tue Aug 30, 2011 11:55 am

When we start talking about hardware architectures and block sizes and stuff, its a good indication that the discussion has gone off the rails. Frankly, none of that stuff matters, and it changes every 18 months or so.

When discussing Dimensional Modelling versus Data Vault (or Inmon's CDF) the question is: Can a dimensional EDW stand alone as the primary data store and historical record or is it necessary to create a repository from which data marts are published. Frankly, I believe a dimensional EDW can stand on its own.

So, when comparing a dimensional EDW with Data Vault, the issue is not about joins, complexities, or performance. When you commit to a store and publish architecture, access to the repository (Data Vault or 3NF DB) is a background process that publishes into cubes, star schema, flat files or standard reports. No (or very little) ad-hoc stuff at all... as that is all done in the published marts.

The issue is, do you need to do it at all? Is there some data structure or relationship or historical record that cannot be represented in a dimensional model? If not, why would one want to build an entire infrastructure to simply store data, and then build another infrastructure so people can use the data?

thomaskejser Tue Aug 30, 2011 12:15 pm

Hi Dan

I think we are moving toward common ground now. Let me restate a few things I think we agree on (please chime in)

- There are two different problems that must be solved: ad-hoc Reporting and audit/archive
- We agree that for reporting purposes, a star schema is the right model
- We agree that hard numbers are difficult to come by and that vendor to vendor compares are hard
- I hope we also agree that if we position two models against each other, we can calculate backwards to the IOPS, Memory and Network required to perform operations (and get in the right O-notation ballpark on CPU)
- We agree that parallism is good and that it is a desired characteristic of a database
- We agree that stripping down a star schema to type1 is an interesting approach (and one that I am also recommending). But I disagree with your idea of making every relationship into a link. Also, if "strip down a star schema to type1) is what data vault is, then call it that. I dont understand why you want to invent all those new terms and make those grand claims that it helps solve scale or archival in some magic way that star schemas do not.

I am NOT claiming that parallelism is bad - but I am telling you that normalization does not give you more of it, on the contratry, loop join strategies (which are the ones that scale in well in normalized models) are extremely expensive in CPU cycles as compared to hash strategies.

With regards to pre-emptive vs. cooperative multi tasking - I think that is going off on a tangent. I assume we agree that every modern operating system scales well even on large scale nodes?

Can we also agree that super wide tables, with a LOT of columns (thousands) are a non-issue for data models (you can always model "downwards" instead, by unpivoting data).

If I understand you correctly, you claim:

- That column stores and star schemas do not scale to the PB size for queries (we should agree WHICH queries you mean, reporting?)
- That a normalized model a la Data Vault DOES scale to the PB size, even when you need to join two or more huge tables that are not co-located?

On this note: I simply dont follow how you get from "you need more IOPS to scan a wide table if you only need a few columns" to "you should normalize by using Data Vault Guidance". If you said that there are cases where vertical partitioning of fact tables matter (i.e. a 1-1 "normalization") then I think we could be more on common ground.

Could you also elaborate what you mean by "blocking I/O". Do you mean that the task blocks while it waits for I/O or that the I/O system itself gets blocked? To use SQL Server as an example, it will use I/O completion ports to issue the I/O request. They are described here:

http://msdn.microsoft.com/en-us/library/aa365198(v=vs.85).aspx

Highlight this note: "I/O completion ports provide an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system" - the details are described in the paper. You will see that neither pagefile.sys, nor blocking of the entire I/O system is involved. Note that EACH port has its own concurrency value, you can open as many as you like (until you run out of memory or threads :))

With regards to the 8K, 64K and 512K questions: The Fast Track guidanance gives you some good basic information on this: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/fast-track.aspx. Apart from that, the best source really is Russinovich: http://technet.microsoft.com/en-us/sysinternals/bb963901. If you want to convince yourself that larger than 64K I/O requests are possible, it is easy to use a tool like SQLIO or IOMeter to set up a test and simply measure the request sizes using Perfmon (the counter you are looking for is avg disk bytes/read. You can also use this reference to read about how SQL Server I/O scaling and balancing http://msdn.microsoft.com/en-us/library/ee410782.aspx (incidentally, partially written by yours truly). You might also want to look at the read-ahead described here: http://msdn.microsoft.com/en-us/library/ms191475.aspx - a similar construct exists in all major databases.

With regards to the pagefile.sys: I dont understand why you are so focused on what that file? It is just the virtual memory backing the RAM. Database engines on the Windows platform do not use this file unless an exceptional condition occurs (they typically pin memory in non-swappable large pages instead) and I/O operations just don't touch it. Perhaps the exceptional conditions you found is why you are wondering about it? It is true that we will sometimes dedicate a disk to pagefile.sys, but this is done to respond to the rare cases where the SQL Server process fights with another process for memory, for example in the case of Analysis Services running at the same time (the large pages does not protect the exe itself from being swapped out - so you want the exe back in main memory asap). This is NOT what is done on a large scale SQL System. On a poorly configured cluster, this can cause the heartbeat to fail, so this is why a dedicated drive is recommended "just in case". So please read this in the right context.

With regards to fragmented reads and random/sequential: Let us just assume that all pages are full. This is achievable with proper loading techniques (described here: http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx). In that case, a high readahead, 512K block still beats random seeks. Also note that SQL Server has an optimization call "unordered scan" where the order of index pages do not matter with regards to sequential IOPS (you can read about unordered here: http://www.sqlmag.com/article/sql-server/quaere-verum-clustered-index-scans-part-i)

I will post more, but trying to catch up on all our open questions.

thomaskejser Tue Aug 30, 2011 12:38 pm

Hi ngalemmo

Thanks for chiming in. I agree that there is no need for any other repository than the dimensional model. However, I think Dan makes a very strong point in: http://www.tdan.com/view-articles/5054/

"The problem is scalability and flexibility. If an additional parent table is added, the change is forced to cascade down through all subordinate table structures. Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key."

This is really just a fancy way to say that if you change your mind about type2 history tracking, you are faced with a problem on your fact table keys. Because you now have to update all keys in all affected facts. I think this is the big one that has hurt some dimensional modelers when they try to scale their data model. Especially if you have trouble writing a fast update (and there are clearly engines that are better at this than others) or believe that there is some law that makes updates hard for databases to do at scale.

I think there is some point in "carving off" the type2 attributes into a separate table because of this issue (This is similar to Dan's satellite). The issue with doing this is that you now have to do a nasty BETWEEN join (on from/to dates) in order to access fact data. Such joins are hard for query optimizers to optimize - because it is difficult to get a good cardinality estimate. OF course, the increased price of the BETWEEN also bubbles up to either the user or the data mart loader - something that has to be accounted for.

ngalemmo Tue Aug 30, 2011 2:41 pm

If an additional parent table is added, the change is forced to cascade down through all subordinate table structures.

I guess it refers to adding a new dimension to an existing fact table. It can happen, but if it was designed right to begin with (enterprise view rather than building to specific reports) it would be rare. And if you are to apply the change retroactively, it is a little bit of work (maybe 10-20hrs).

Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key.

This I have no clue what it is referring to. If it is a type 2, you never go back and change fact keys when new rows are added. Even if you change the dimension from a type 2 to a type 1, there is never any reason to rekey the facts... you simply adjust the dimension.

The issue with doing this is that you now have to do a nasty BETWEEN join (on from/to dates) in order to access fact data.

Why would you need to do this? Its a simple join between a fact and a type 2. The only time you would filter a join against a type 2 is if you want to adjust the dimension to a point in time other than the point in time of the fact. In other words, getting the current dimension values. Simply implementing both a type 1 and type 2 version of the dimension clears that up.

The thing is, none of these are issues that cannot be resolved or handled in a dimensional model. There may be some simplicities that the data vault provides, but at what cost? It involves creating a separate and distinct environment purely to store data, with no guarantee that you may need to do some rework. Its a lot of effort to maybe save a few man weeks over a period of years because of unknown or unanticipated model changes.

There are a few other choice tidbits in the article:

One of the most difficult issues of a conformed data mart (or conformed fact tables) is getting the grain right. That means understanding the data as it is aggregated for each fact table and assuring that the aggregation will stay consistent for all time (during the life of the relationship) and the structure of each fact table will not change (i.e., no new dimensions will be added to either fact table). This limits design, scalability and flexibility of the data model.

This is kind of aluding to my earlier comment of building it right. If for a moment you think that data is aggregated for a fact table, you don't understand dimensional data warehousing. Fact tables should be built at the lowest level of detail available from the source data. Period. Aggregation is purely optional, stored in separate fact tables, and usually done for performance reasons.

Another issue is the “helper table.” This table is defined to be a dimension-to-dimension relationship Link.

A helper or bridge table sits between a fact and a dimension. It is not a dimension-to-dimension link. From a relational point of view, it is an associative entity used to resolve many-to-many relationships. Since a many-to-many relationship is the function of the relationships in the data, and not the modeling methodology, I fail to see how this is an issue with dimensional design.

Here's the thing. The data vault introduces a level of abstraction over what we would normally view as a data structure (a row and its columns). Fine, nothing wrong with that. And, if I was to implement a store and publish architecture for a data warehouse, it is a methodology well worth considering. But are data structures used in business really that dynamic that relatively minor maintainance issues matter? I haven't see that. What I have usually seen is large organizations spending years and countless millions of dollars updating their ERP. With timeframes like that, I wouldn't sweat the occasional schema change.

thomaskejser Tue Aug 30, 2011 2:58 pm

The thing is, none of these are issues that cannot be resolved or handled in a dimensional model. There may be some simplicities that the data vault provides, but at what cost? It involves creating a separate and distinct environment purely to store data, with no guarantee that you may need to do some rework. Its a lot of effort to maybe save a few man weeks over a period of years because of unknown or unanticipated model changes.

...We are in agreement, ask Dan :-)

With regards to the Type2 changes, here is the scenario:

Let us say you have a some fact tables keyed to the customer dimension. You decide to do type2 change tracking on the customer city and type1 on the name.

At T(0) your world looks like this:

SK,CustomerID,Name,City,From, To
1, A, Thomas,London,2000, 2001
2, A, Thomas,Berlin,2002, 9999

Now, at T(1), you decide that you also want to do type2 tracking on customer education. This changes faster than City, so you now get a larger customer dimension with more rows, and more surrogate keys. For example this (using floats for SK, just for readability):

SK,Customer,Name,City,Education,From, To
1,A,Thomas,London,None,2000, 2001
1.1,A,Thomas,London,Master,2001, 2002
2,A,Thomas,Berlin,Master,2002, 9999

Here is the issue: the large fact tables that are keyed to The London (SK=1) now needs to have their SK updated to 1.1 if the transaction data is in 2001 and the education has changed to "master". This could potentially be a rather large update statement on several fact tables (travel the dependency tree here). If you are a PB sized installation and not careful about writing that update correctly, you could be looking at days, or even weeks until the change has been affected (DBAs get this type of stuff wrong all the time)

The alternative is to "fork off" the history. So you get this two-table "dimensional" structure

StableSK, Customer, Name
1,A, Thomas

HistorySK, StableSK, City, Education, From, To
42, 1, London, None, 2000, 2001
43, 1, London, Master, 2001, 2002
44, 1, Berlin, Master, 2002, 9999

If you key the fact on StableSK, you can change your mind as much as you like about the type2 changes, without incurring the big update. Of course, this means that the join between the history table and the fact table is much more expensive (since it is now a non-equi join, the BETWEEN I was referring to). I think that tradeoff is generally worth it, but the obvious issues about optimizer trouble of course rear their ugly heads here.

ngalemmo Tue Aug 30, 2011 3:29 pm

The issue is retroactively adjusting history, not changing the conditions of a type 2. You can always change a type 2 and let things fall into place moving forward without a lot of effort.

And, yeah, if you had to retroactively change things, it would be a real pain in the neck. But, as yet, I haven't had to. And the idea of a stable surrogate key has been discussed before, but as an alternate key to a type 2, not as the PK (in conjunction with date) in one table rather than two.

I don't like the forking idea (no pun intended) of the static key as it impacts query performance for all time over a rare occurance of a one-time rekeying of the fact table to do retroactive history adjustments, assuming old history is even available from the source system.

thomaskejser Tue Aug 30, 2011 3:50 pm

ngalemmo wrote:
I don't like the forking idea (no pun intended) of the static key as it impacts query performance for all time over a rare occurance of a one-time rekeying of the fact table to do retroactive history adjustments, assuming old history is even available from the source system.

Still, a good pun :-)

I think it is a valid question to ask: "How large is that performance impact?". The re-keying is a real problem that I have experienced at scale. With badly written update statements, we are literally talking weeks of running, on even a "largish" 100TB system. You are amortizing a potentially VERY high pain over a lot of smaller user queries - and this may be a good idea, depending on the impact. But a well written UPDATE statement on an MPP should get you in the tens of GB/sec to the disks (especially if you are laid out well for sequential :-)

So the question is: is it worth giving up hash joins and trading them for loops (painful) and also: most column stores also struggle with this particular query pattern.

Note that by doing the forking trick, you only affect queries that need the type2 history, type1 queries can still go to the "StableSK" table.

I am also curious to know if any database vendors have a nifty way to perform this BETWEEN From/To join efficiently (there are clearly ways to get some of the hash performance here). Hope a DBA will chime in here.

hang Tue Aug 30, 2011 5:08 pm

ngalemmo wrote:Can a dimensional EDW stand alone as the primary data store and historical record or is it necessary to create a repository from which data marts are published.

Hi Thomas, I suspect it is the intention of Master Data Services (MDS) in Denali (SQL Server). I have not got my hands on the new feature, but have seen powerful demo in PASS that you may incrementally add (publish) the new attributes to any dimension without developer doing any DDL or update on the table.

I have also seen the session by Joy Mundy explaining the role MDS can play to ensure data quality, proper metadata management etc. to streamline ETL process.

thomaskejser Tue Aug 30, 2011 5:20 pm

Hang

I think that Metadata and MDS are orthogonal to the "store and publish or just use star schema" architectural question. No matter which architecture you build, you will need both metadata and MDS to make it manageable at scale. Also, both Dan's Vault model and the star schema can be the recipients or semi-automated ETL flows auto generated from metadata. I have build meta frameworks that load both of them - and they are equally simply to load. Though the star schema lends itself better to parallelism because it has less key constraints to worry about. Of course, you can turn off key validation in the data vault, but that kind if defeats the purpose (and also leads optimizers astray).

All that being said, the notion that users can build their own hierarchies and add their own attributes to dimensions is a rather tempting idea and one you can see in Denali. There are cases where it makes sense for a user to have his own copy of a dimension table and then populate the attributes as they see fit (but maintaining the surrogate key). This of course creates some interesting indexing problems - but I think they are surmountable, even at scale.

hang Tue Aug 30, 2011 6:20 pm

Thomas

Thanks for the comments. However MDS does have big implication on ETL process that has played a big part and determines the success of the dimensional modelling architecture. I guess with MDS in place, ETL process would become more manageable, or somehow automated as you put.

Could you please also clarify that the underlying data structure for MDS is more in line with DAN’s data vault model and quite different from dimensional model or any other traditional 3NF model per se.

thomaskejser Tue Aug 30, 2011 6:45 pm

Hi Hang

I agree that a Master data repository looks a lot more like vault at first glance. However, just because a schema has tables with start- and end dates does not make it a data vault - it just makes it a system that capture changes to tables (OLTP designers may have designed such "log tables" using triggers before).

If you are designing the schema for a master data management system - you generally want that schema to be extensible from the client to allow business users to model the world as they think it looks. This means that you are often building something that is somehow a form of name/value pair database - but which also should contain a materialized version of the same data into something that looks a lot like type2 dimensions so they are easy to query. Is that a Data Vault or just sound 3NF design?...

When I think about MDM, I think of it as one (maybe the only, if done right) source of dimension data - not facts. Because of that, it is a small database system that can take some liberties about the schema it uses (performance is less critical, though still important). But even small databases like that still have a lot of value - especially if the MDM knows how to handle keys properly.

Be aware that the MDM can actually make your life harder as an EDW designer: cleaning of data can take a long time if it is not done with good tuning. This means that the MDM can be "lagging behind" the sources from the perspective of the EDW (which may be connected directly to message busses and source extracts). This can lead to all sorts of interesting late/early arriving key problems. It is also not unusual to see the EDW solve data quality problems (because they need that solution NOW in their reports) before the rest of the organization is done doing vendor selection on a MDM system. You may have experienced the interesting political situations and discussions about who does the best job at handling keys and data cleaning - the EDW or the MDM?

With regards to key management: In my blog at http://blog.kejser.org I have described structure that I call "map tables" that are an important part of the ETL flow that I argue for. They look somewhat like Dan's satellites, but as you can see in some of the entires, they serve a differnet purpose and some interesting characteristics that help you manage key errors in the sources. Ideally, a MDM repository would supply tables that have that structure to the EDW - along with the cleaned source data of dimensions.

dm_unseen Wed Aug 31, 2011 4:13 am

Thomas,

Your "map tables" appear (in slightly different form) in a Data Vault in the Business (Rule) Data Vault section. Your alternate keys can be satistfied with keysats. Basically the same look and feel but slightly more flexible and because of auditability also usable for data quality feedback loops to OLTP source systems as well.

Another problem I have with all of your stuff is that most of it should handle bitemporality as well, else you will never be able to fix business rules without destroying your auditability. This goes for most of this disussion BTW.

Also when only the present is important in a DV you can always partition off the past, logically the past and present of the same info still form 1 entity (except for model changes). Physically splitting it and saying it is different is just doing physical optimizations on the database schema level.

However I do seriously doubt that your architecture will allow for an (small) incremental and flexible DWH implementation without serious rework. IMO it looks either still too much "big bang" which i think is one of the biggest issues of Kimball and Inmon "old style".

One of the interesting philisophical differences between Data Vault and kimball is that Data Vault tries to reconcilliate and align source and the DWH as much as possible whille Kimball tries to have "improved" or even "the best" data quite distinctly from the sources. Or to put it another way, Kimball polarizes your data between sources and DWH while DV tries to unify them. From a more holistic Data/information Architecture standpoint (not just DWH) I think unification is the better approach.

Sponsored content

Data Vault v's Dimensional Model

Specific answers to your questions

Re: Data Vault v's Dimensional Model

Hi Hang

Assumptions and data structures

Not here to debate performance.

Assumptions - still getting it wrong

Let's not get personal here.

Re: Data Vault v's Dimensional Model

Start from an example

Re: Data Vault v's Dimensional Model

Again, no direct links - no answers to my questions.

More on block sizes:

Re: Data Vault v's Dimensional Model

Re: Data Vault v's Dimensional Model

Some answers...

Dimensional repository standing on it's own

Re: Data Vault v's Dimensional Model

The Type2 driven update statement

Re: Data Vault v's Dimensional Model

To fork or not to fork

Re: Data Vault v's Dimensional Model

MDS and metadata

Re: Data Vault v's Dimensional Model

MDM looks more like Vault - or does it?

Re: Data Vault v's Dimensional Model

Re: Data Vault v's Dimensional Model