In this article we’ll identify how normalization and denormalization differ between OLTP and Data Warehouse models.
We’ve all heard about normalization, based on CODDS theory and how it’s supposed to be applied in the design phases of application development. First let’s describe the requirements that get us to the level of intuitive (3NF) normal form:
1NF – no repeating groups – ensures model flexibility. Remove all repeating attributes or groupings of fields and place them in their own entity.
2NF – all non key attributes depend on the Primary Key (PK) – ensures model consistency. Ensure all non key attributes are dependent on all PK attributes.
3NF – no transitive dependencies. Ensure that no non-key attribute depends on any other non-key attribute.
Each of these “tests” ensure the flexibility, consistency and integrity of the OLTP model, and are applied in series (it is said that you can’t be ‘in’ 3NF until you are in 2NF, but I can find exceptions to this rule). The further you go through normalization (4NF, BOYCE-CODD NF, 5NF), the less the risk that a processing inconsistency, however caused, will compromise the data integrity. Once the model is normalized, you can effectively denormalize the design for performance purposes (you can only denormalize a normalized model).
But let’s reflect on OLTP normalization practices in the data warehouse. The worry over integrity and consistency is minimized since the ETL programmers spend all weekend fixing the source system data errors in the ETL layer. So our target data warehouse is as close to 100% accurate as possible, and we will market this aspect back to the business community, noting the data warehouse is the definitive source for answers to business questions. Further to this the data warehouse flexibility will be supported in the model through structural (intentional) redundancy in the Dimensional model designs, which is acceptable given that the ETL folks maximized integrity and consistency on the incoming side to the data warehouse.
How then can we normalize the DATA WAREHOUSE model, without the primary focus being data integrity and consistency, so that it leads to effective denormalization? Let’s look at some alternative approaches.
First approach – DATA WAREHOUSE’s are Subject oriented – so we can normalize the DATA WAREHOUSE data by grouping it by business subjects. So from all of the source applications, you can model a single CUSTOMER Subject, that comprises all of the required analytical data pertaining to CUSTOMER (there may be many entities in the CUSTOMER SUJBECT).
Second approach – DATA WAREHOUSE’s are time variant, that is most analysis involves forecasting the future based upon previous events. The challenge arises when the data’s nature or values changes to the extent that historical analysis can no longer be applied to the current state of the data. An approach that will yield much more flexibility is to group (normalize) the data into classes of data that change frequently, and separate them from classes of data that never change. This will lead to a more effective application of denormalization tactics, of which Ralph Kimball’s Type 1,2, and 3 slowly changing dimensions are the most common examples.
Third approach – We can also normalize based on the nature by which we analyse the data – that is group together data that is analysed at specific times (State based analysis) vs data that is analysed as it happens (Event based analysis). State based analysis requires a balanced periodicity (e.g. every Sunday we publish a new state for the given data), whereas Event based analysis can be published as it happens in the source system (and is sometimes the focus of operational analytics).
Within denormalization, we strive for performance and availability, very typical goals with both OLTP and data warehouse systems. With an OLTP model the performance requirements are dictated within the Service Level Agreement obligations (maximum transactions per minute, speed of queries less than x seconds, transaction consistency and data consistency a must).
In the DATA WAREHOUSE we manage performance by firstly managing user expectations (providing realistic turn around times, such as ‘no you can’t get a 500 million row table to join a 100 M row table to join a 10M row table and run in less than 3 seconds!’), and by utilizing scalable hardware and DBMS objects (MQT’s, Indexable VIEWS, Materialized Views). The point then is that denormalization in the DATA WAREHOUSE has similar considerations and approaches, with the recognition that we can focus on approaches that are not so centric to maintaining data integrity.
In this article we’ve provided the characteristics of 3NF within OLTP models, and identified why we need not be constrained by these requirements within the data warehouse. We’ve provided 3 alternative approaches to Data Warehouse normalization and discussed the goals of denormalization in each system.
Next time you hear the phrase “that data warehouse model is in 3NF”, look closely at the structures and see if there are alternatives that may yield more flexibility for the data warehouse analyst.