The Data Administration Newsletter (TDAN.com)

THE DENORMALIZATION SURVIVAL GUIDE - PART 1
Steve Hoberman - Steve Hoberman & Associates, LLC

This the first of two articles on the Denormalization Survival Guide, adapted from Chapter 8 of Steve Hoberman's book, the Data Modeler's Workbench (Order this book through Amazon.com Today!). This first article focuses on the dangers of denormalization and introduces the Denormalization Survival Guide, which is a question-and-answer approach to applying denormalization to our data models. The second article will discuss the questions and answers that comprise this guide when modeling a data mart.

The Dangers of Denormalization

In her book entitled, Handbook of Relational Database Design, Barbara von Halle has a great definition for denormalization: “Denormalization is the process whereby, after defining a stable, fully normalized data structure, you selectively introduce duplicate data to facilitate specific performance requirements.” Denormalization is the process of combining data elements from different entities. By doing so, we lose the benefits of normalization and, therefore, reintroduce redundancy into the design. This extra redundancy can help improve retrieval time performance. Reducing retrieval time is the primary reason for denormalizing.

My favorite word in von Halle's definition is “selectively”. We have to be very careful and selective where we introduce denormalization because it can come with a huge price even though it can decrease retrieval time. The price for denormalizing can take the form of these bleak situations:

	Update, delete, and insert performance can suffer. When we repeat a data element in two or more tables, we can usually retrieve the values within this data element much more quickly. However, if we have to change the value in this data element, we need to change it in every table where it resides. If Bob Jones appears in five different tables, and Bob would prefer to be called “Robert”, we will need to change “Bob Jones” to “Robert Jones” in all five tables, which takes longer than making this change to just one table.
	Sometimes even read performance can suffer. We denormalize to increase read or retrieval performance. Yet if too many data elements are denormalized into a single entity, each record length can get very large and there is the potential that a single record might span a database block size, which is the length of contiguous memory defined within the database. If a record is longer than a block size, it could mean that retrieval time will take much longer because now some of the information the user requests will be in one block, and the rest of the information could be in a different part of the disk, taking significantly more time to retrieve. A Shipment entity I’ve encountered recently suffered from this problem.
	You may end up with too much redundant data. Let's say the CUSTOMER LAST NAME data element takes up 30 characters. Repeating this data element three times means we are now using 90 instead of 30 characters. In a table with a small number of records, or with duplicate data elements with a fairly short length, this extra storage space will not be substantial. However, in tables with millions of rows, every character could require megabytes of additional space.
	It may mask lack of understanding. The performance and storage implications of denormalizing are very database- and technology-specific. Not fully understanding the data elements within a design, however, is more of a functional and business concern, with potentially much worse consequences. We should never denormalize without first normalizing. When we normalize, we increase our understanding of the relationships between the data elements. We need this understanding in order to know where to denormalize. If we just go straight to a denormalized design, we could make very poor design decisions that could require complete system rewrites soon after going into production. I once reviewed the design for an online phone directory, where all of the data elements for the entire design were denormalized into a single table. On the surface, the table looked like it was properly analyzed and contained a fairly accurate primary key. However, I started grilling the designer with specific questions about his online phone directory design:

“What if an employee has two home phone numbers?”

“How can we store more than one email address for the same employee?”

“Can two employees share the same work phone number?”

After receiving a blank stare from the designer, I realized that denormalization was applied before fully normalizing, and therefore, there was a significant lack of understanding of the relationships between the data elements.

It might introduce data quality problems. By having the same data element multiple times in our design, we substantially increase opportunities for data quality issues. If we update Bob's first name from Bob to Robert in 4 out of 5 of the places his name occurred, we have potentially created a data quality issue.

Being aware of these potential dangers of denormalization encourages us to make denormalization decisions very selectively. We need to have a full understanding of the pros and cons of each opportunity we have to denormalize. This is where the Denormalization Survival Guide becomes a very important tool. The Denormalization Survival Guide will help us make the right denormalization decisions, so that our designs can survive the test of time and minimize the chances of these bleak situations from occurring.

What Is the Denormalization Survival Guide?

The Denormalization Survival Guide is a question-and-answer approach to determining where to denormalize your logical data model. The Survival Guide contains a series of questions in several different categories that need to be asked for each relationship on our model. There are point values associated with answers to each question. By adding up these points, we can determine whether to denormalize each specific relationship. If our score is 10 or more after summing the individual scores, we will denormalize the relationship. If our score is less than 10, we will keep the relationship normalized and intact. When you are done asking these questions for each relationship, you will have a physical data model at the appropriate level of denormalization.

There are two main purposes to using the Denormalization Survival Guide:

Maximizing the benefits of denormalization and minimizing the costs. Whenever we denormalize, we potentially gain something but also potentially lose something. The purpose of the Denormalization Survival Guide is to make sure we maximize our gains while minimize our losses. Therefore, if using the guide dictates that we should denormalize Account into the Customer entity, it will be because the benefits of denormalization outweigh the costs.

Providing consistent criteria for denormalization. We are applying the same set of questions and point values to each relationship. Therefore, there is a consistency that we will use in deciding when to denormalize. If two relationships have similar answers, either both will be denormalized or both will remain normalized. This consistency leads to greater reuse and understanding. In addition, because we will start to see the same structures in multiple places, we will more easily be able to learn and understand new structures. For example, if Customer and Account are denormalized into a single entity, we will be able to understand this denormalized Customer Account entity in other places it occurs.

Summary

This first article focused on the dangers of denormalization and introduced the Denormalization Survival Guide. The next article in this series will discuss the questions and answers that comprise the guide, including these questions when determining whether to denormalize a relationship in a data mart:

	What type of relationship?
	What is the participation ratio?
	How many data elements are in the parent entity?
	What is the usage ratio?
	Is the parent entity a placeholder?
	What is the rate of change comparison?

THE DENORMALIZATION SURVIVAL GUIDE PART 2
Steve Hoberman and Associates, Inc.
[Return to the How-To Articles]

This is the second of two articles on the Denormalization Survival Guide, adapted from Chapter 8 of Stev Hoberman's book, the Data Modeler's Workbench (You can order this book through Amazon.com). The previous article focused on the dangers of denormalization and introduced the Denormalization Survival Guide, which is a question-and-answer approach to applying denormalization to our logical data models. This article will discuss the questions and answers that comprise this guide when modeling a data mart.

The Denormalization Survival Guide for a Data Mart

We will go through each of the questions in the Denormalization Survival Guide and introduce a handy template for recording the scores for each question. The questions for a data mart are:

	What is the type of relationship?
	What is the participation ratio?
	How many data elements are in the parent entity?
	What is the usage ratio?
	Is the parent entity a placeholder?
	Is there a similar rate of change?

An important concept to fully understand is that there is a parent and a child in every relationship on our design. The parent is on the "one" side of the relationship, and the child is on the "many" side of the relationship. For example, if a Customer can place many Orders, Customer would be the parent and Order would be the child. Remember that the entity containing the foreign key for the relationship is the child. Order would contain the foreign key back to Customer, and therefore, Order is the child. The terms parent and child appear quite often in our survival guide questions.

In addition, the higher the scores, the greater the chance we will be denormalizing the relationship. If our score is 10 or more after summing the scores on each question, we will denormalize the relationship. If our score is less than 10, we will keep the relationship normalized and intact. I have found that the number 10 works for me in distinguishing the normalization/denormalization border. 10 is really an arbitrary number I’ve chosen because after answering the questions, the total scores are usually much higher or lower than 10, so I could have also easily chosen 15 or 5. When you are done asking these questions for each relationship, you will have a physical data model at the appropriate level of denormalization. Let’s briefly discuss each question.

What is the type of relationship?

This question addresses the type of relationship we are analyzing. What relationship does the parent entity have to the child entity? We choose one of three options: Hierarchy (worth 20 points), Peer (worth -10 points), or Definition (worth -20 points). A hierarchy relationship is when one entity contains, includes, or encompasses another entity. For example, Order Header contains Order Lines. Peer is a relationship where both child and parent entities have an independent existence. For example, Sales Person services Customers. Definition is when the parent entity determines the meaning of the child entity. Definition relationships are either relationships from transaction to reference data, or associative entities. For example, Customer Account is the associative entity between Customer and Account. Hierarchy is worth the most points meaning there is the greatest chance for the entities participating in this relationship to be denormalized. This is because in data marts the users often select information from multiple levels of a hierarchy in the same query.

What is the participation ratio?

For a given parent entity value, roughly how many children entity values would we have? The closer to a one-to-one relationship between parent and child, the greater the chance we will denormalize. This is because a relationship with a one-to-10 ratio will have, on average, 20 percent of the redundancy of a relationship with a one-to-50 participation ratio. We need to choose one of three options for this question: Up to a one-to-five ratio (worth 20 points), up to a one-to-100 ratio (worth –10 points), and over a one-to-100 ratio (worthy –20 points).

How many data elements are in the parent entity?

This question addresses the redundancy in terms of extra space from parent entity data elements we will need in the child entity if we denormalize the parent entity into the child entity. If there are less than 10 parent data elements, it means there will usually be a minimal amount of redundancy on each child record. Thus, we get 20 points. If there are between 10 and 20 data elements, we get –10 points. If there are more than 20 data elements from the parent, we get –20 points. For this question, also keep in mind the length of each of the data elements. For example, adding 20 one-character data elements introduces less redundancy than adding five 50-character data elements.

What is the usage ratio?

When users need information from the child, do they often include information from the parent? Likewise, if users need information from the parent, do they often include information from the child? In other words, how tightly coupled or correlated are these two entities? If the data elements from two entities will appear together on many user queries and reports, the information will be retrieved more quickly if it is within the same table instead of joining multiple tables. Because of the importance of this question, I have assigned it more weight than the other questions. After all, if very little correlation between two entities exists, why would we want to denormalize them? Assign 30 points for a strong correlation with each other and -30 points for a weak or no correlation with each other.

Is the parent entity a placeholder?

Are we going to add more data elements or relationships to the parent entity in the near future? If the answer is no (worth 20 points), then there is a stronger tendency to denormalize. This is because we would not need to worry about the extra space and redundancy new parent data elements would cause. Also, we do not have to worry about integrity and the enforcement of new business rules against the parent. It also means that we will not impact the child entity with parent entity changes, thereby minimizing database maintenance problems. If the answer is yes (worth -20 points), then we are going to add data elements in the near future and thus have a tendency to keep the two entities separate.

Is there a similar rate of change?

This question addresses whether the two entities have a similar number of inserts and updates within the same time period. If one of the two entities changes very seldom, whereas the other entity changes very often, there is a strong tendency to keep them normalized in separate tables (worth -20 points). Separate tables can avoid performance and synchronization data issues. If the two entities have roughly the same rate of change, it would make more sense to denormalize them into the same entity; therefore, we receive 20 points.

Denormalization Survival Guide Template

A very good way to keep the scores for each relationship is through a spreadsheet. See the following table. The first column contains each of the six questions. The second column contains the answers and their point values. The third column contains the actual points, and their sum appears at the end.

Denormalization Survival Guide Questions Template

Question	Options	Points
What is the type of relationship?	Hierarchy (20) Peer (–10) Definition (–20)
What is the participation ratio?	1-1 to 1-5 (20) 1-5 to 1-100 (–10) 1-100 and up (–20)
How many data elements are in the parent entity?	Less than 10 (20) Between 10 and 20 (–10) More than 20 (–20)
What is the usage ratio?	Strong correlation with each other (30) Weak or no correlation with each other (–30)
Is the parent entity a placeholder?	Yes (–20) No (20)
Is there a similar rate of change?	Same (20) Different (–20)
TOTAL POINTS

Summary

This article discussed the questions and answers that comprise the Denormalization Survival Guide when designing a data mart:

	What is the type of relationship?
	What is the participation ratio?
	How many data elements are in the parent entity?
	What is the usage ratio?
	Is the parent entity a placeholder?
	Is there a similar rate of change?

Although you can use the questions and point values discussed in this article exactly as they appear, you might want to consider some customization. You might have different questions you would like to ask or might assign different point values, depending on how important each question is in relation to the others.

Steve Hoberman is an expert in the fields of data modeling and data warehousing. He is currently the Lead Data Warehouse Developer for Mars, Inc. He has been data modeling since 1990 across industries as diverse as telecommunications, finance, and manufacturing. Steve speaks regularly for the Data Warehousing Institute. He is the author of the Data Modeler's Workbench, Tools and Techniques for Analysis and Design. Steve specializes in data modeling training, design strategy, and in creating techniques to improve the data modeling process and deliverables. He enjoys reviewing data models and is the founder of Design Challenges, a discussion group which tackles complex data modeling scenarios. To learn more about his data model reviews and to add your email address to the Design Challenge distribution list, please visit his web site at http://www.stevehoberman.com/. He can be reached at me@stevehoberman.com or 973-586-8765.