The Data Administration Newsletter (TDAN.com)
THE DENORMALIZATION
SURVIVAL GUIDE - PART 1
Steve Hoberman - Steve
Hoberman & Associates, LLC
This the first of two articles on the Denormalization Survival Guide, adapted from Chapter 8 of Steve Hoberman's book, the Data Modeler's Workbench (Order this book through Amazon.com Today!). This first article focuses on the dangers of denormalization and introduces the Denormalization Survival Guide, which is a question-and-answer approach to applying denormalization to our data models. The second article will discuss the questions and answers that comprise this guide when modeling a data mart.
In her book entitled, Handbook of Relational
Database Design, Barbara von Halle has a great definition for
denormalization: “Denormalization is the process whereby, after defining a
stable, fully normalized data structure, you selectively introduce duplicate
data to facilitate specific performance requirements.” Denormalization is
the process of combining data elements from different entities. By doing so, we
lose the benefits of normalization and, therefore, reintroduce redundancy into
the design. This extra redundancy can help improve retrieval time performance.
Reducing retrieval time is the primary reason for denormalizing.
My favorite word in von Halle's definition is “selectively”.
We have to be very careful and selective where we introduce denormalization
because it can come with a huge price even though it can decrease retrieval
time. The price for denormalizing can take the form of these bleak situations:
Update, delete, and insert performance can suffer.
When we repeat a data element in two or more tables, we can usually retrieve
the values within this data element much more quickly. However, if we have to
change the value in this data element, we need to change it in every table
where it resides. If Bob Jones appears in five different tables, and Bob would
prefer to be called “Robert”, we will need to change “Bob Jones” to “Robert
Jones” in all five tables, which takes longer than making this change to just
one table. | |
Sometimes even read
performance can suffer.
We denormalize to increase read or retrieval performance. Yet if too many data
elements are denormalized into a single entity, each record length can get
very large and there is the potential that a single record might span a
database block size, which is the length of contiguous memory defined within
the database. If a record is longer than a block size, it could mean that
retrieval time will take much longer because now some of the information the
user requests will be in one block, and the rest of the information could be
in a different part of the disk, taking significantly more time to retrieve.
A Shipment entity I’ve encountered recently suffered from this problem.
| |
You may end up with too much redundant
data. Let's say the CUSTOMER
LAST NAME data element takes up 30 characters. Repeating this data element
three times means we are now using 90 instead of 30 characters. In a table
with a small number of records, or with duplicate data elements with a fairly
short length, this extra storage space will not be substantial. However, in
tables with millions of rows, every character could require megabytes of
additional space.
| |
It may mask lack of understanding.
The performance and storage implications of denormalizing are very database-
and technology-specific. Not fully understanding the data elements within a
design, however, is more of a functional and business concern, with
potentially much worse consequences. We should never denormalize without first
normalizing. When we normalize, we increase our understanding of the
relationships between the data elements. We need this understanding in order
to know where to denormalize. If we just go straight to a denormalized design,
we could make very poor design decisions that could require complete system
rewrites soon after going into production. I once reviewed the design for an
online phone directory, where all of the data elements for the entire design
were denormalized into a single table. On the surface, the table looked like
it was properly analyzed and contained a fairly accurate primary key. However,
I started grilling the designer with specific questions about his online phone
directory design:
|
“What if an
employee has two home phone numbers?”
“How can we
store more than one email address for the same employee?”
“Can two
employees share the same work phone number?”
After receiving
a blank stare from the designer, I realized that denormalization was applied
before fully normalizing, and therefore, there was a significant lack of
understanding of the relationships between the data elements.
It might introduce data quality problems.
By having the same data element multiple times in our design, we substantially
increase opportunities for data quality issues. If we update Bob's first name
from Bob to Robert in 4 out of 5 of the places his name occurred, we have
potentially created a data quality issue.
|
Being aware of these potential dangers of
denormalization encourages us to make denormalization decisions very selectively.
We need to have a full understanding of the pros and cons of each opportunity we
have to denormalize. This is where the Denormalization Survival Guide becomes a
very important tool. The Denormalization Survival Guide will help us make the
right denormalization decisions, so that our designs can survive the test of
time and minimize the chances of these bleak situations from occurring.
The Denormalization Survival
Guide is a question-and-answer approach to determining where to denormalize your
logical data model. The Survival Guide contains a series of questions in several
different categories that need to be asked for each relationship on our model.
There are point values associated with answers to each question. By adding up
these points, we can determine whether to denormalize each specific relationship.
If our score is 10 or more after summing the individual scores, we will
denormalize the relationship. If our score is less than 10, we will keep the
relationship normalized and intact.
When you are done asking
these questions for each relationship, you will have a physical data model at
the appropriate level of denormalization.
There are two main purposes to using the
Denormalization Survival Guide:
Maximizing the benefits of
denormalization and minimizing the costs.
Whenever we denormalize, we potentially gain something but also potentially
lose something. The purpose of the Denormalization Survival Guide is to make
sure we maximize our gains while minimize our losses. Therefore, if using the
guide dictates that we should denormalize Account into the Customer entity, it
will be because the benefits of denormalization outweigh the costs.
| |
Providing consistent criteria for
denormalization. We are
applying the same set of questions and point values to each relationship.
Therefore, there is a consistency that we will use in deciding when to
denormalize. If two relationships have similar answers, either both will be
denormalized or both will remain normalized. This consistency leads to greater
reuse and understanding. In addition, because we will start to see the same
structures in multiple places, we will more easily be able to learn and
understand new structures. For example, if Customer and Account are
denormalized into a single entity, we will be able to understand this
denormalized Customer Account entity in other places it occurs. |
This first article focused on the dangers of denormalization and introduced the Denormalization Survival Guide. The next article in this series will discuss the questions and answers that comprise the guide, including these questions when determining whether to denormalize a relationship in a data mart:
What type of relationship? | |
What is the participation ratio? | |
How many data elements are in the parent entity? | |
What is the usage ratio? | |
Is the parent entity a placeholder? | |
What is the rate of change comparison? |
THE
DENORMALIZATION SURVIVAL GUIDE PART 2
Steve Hoberman and Associates,
Inc.
[Return to the How-To Articles]
This is the second of two
articles on the Denormalization Survival Guide, adapted from Chapter 8 of Stev
Hoberman's book, the Data Modeler's Workbench (You
can order this book through Amazon.com). The
previous article focused on the
dangers of denormalization and introduced the Denormalization Survival Guide,
which is a question-and-answer approach to applying denormalization to our
logical data models. This article will discuss the questions and answers that
comprise this guide when modeling a data mart.
We will go through each of the
questions in the Denormalization Survival Guide and introduce a handy template
for recording the scores for each question. The questions for a data mart are:
What is the type of relationship? | |
What is the participation ratio? | |
How many data elements are in the parent
entity? | |
What is the usage ratio? | |
Is the parent entity a placeholder? | |
Is there a similar rate of change? |
An important concept to fully
understand is that there is a parent and a child in every relationship on our
design. The parent is on the "one" side of the relationship, and the child is on
the "many" side of the relationship. For example, if a Customer can place many
Orders, Customer would be the parent and Order would be the child. Remember that
the entity containing the foreign key for the relationship is the child. Order
would contain the foreign key back to Customer, and therefore, Order is the
child. The terms parent and child appear quite often in our survival guide
questions.
In addition, the higher the
scores, the greater the chance we will be denormalizing the relationship. If our
score is 10 or more after summing the scores on each question, we will
denormalize the relationship. If our score is less than 10, we will keep the
relationship normalized and intact. I have found that the number 10 works for me
in distinguishing the normalization/denormalization border. 10 is really an
arbitrary number I’ve chosen because after answering the questions, the total
scores are usually much higher or lower than 10, so I could have also easily
chosen 15 or 5. When you are
done asking these questions for each relationship, you will have a physical data
model at the appropriate level of denormalization. Let’s briefly discuss each
question.
This question addresses the type of
relationship we are analyzing. What relationship does the parent entity have to
the child entity? We choose one of three options: Hierarchy (worth 20 points),
Peer (worth -10 points), or Definition (worth -20 points). A hierarchy
relationship is when one entity contains, includes, or encompasses another
entity. For example, Order Header contains Order Lines. Peer is a relationship
where both child and parent entities have an independent existence. For example,
Sales Person services Customers. Definition is when the parent entity determines
the meaning of the child entity. Definition relationships are either
relationships from transaction to reference data, or associative entities. For
example, Customer Account is the associative entity between Customer and
Account. Hierarchy is worth the most points meaning there is the greatest chance
for the entities participating in this relationship to be denormalized. This is
because in data marts the users often
select information from multiple levels of a hierarchy in the same query.
For a given parent entity value, roughly
how many children entity values would we have? The closer to a one-to-one
relationship between parent and child, the greater the chance we will
denormalize. This is because a relationship with a one-to-10 ratio will have, on
average, 20 percent of the redundancy of a relationship with a one-to-50
participation ratio. We need to choose one of three options for this question:
Up to a one-to-five ratio (worth 20 points), up to a one-to-100 ratio (worth –10
points), and over a one-to-100 ratio (worthy –20 points).
This question addresses the redundancy in
terms of extra space from parent entity data elements we will need in the child
entity if we denormalize the parent entity into the child entity. If there are
less than 10 parent data elements, it means there will usually be a minimal
amount of redundancy on each child record. Thus, we get 20 points. If there are
between 10 and 20 data elements, we get –10 points. If there are more
than 20 data elements from the parent, we get –20 points. For this
question, also keep in mind the length of each of the data elements. For example,
adding 20 one-character data elements introduces less redundancy than adding
five 50-character data elements.
When users need information from the child,
do they often include information from the parent? Likewise, if users need
information from the parent, do they often include information from the child?
In other words, how tightly coupled or correlated are these two entities? If the
data elements from two entities will appear together on many user queries and
reports, the information will be retrieved more quickly if it is within the same
table instead of joining multiple tables. Because of the importance of this
question, I have assigned it more weight than the other questions. After all, if
very little correlation between two entities exists, why would we want to
denormalize them? Assign 30 points for a strong correlation with each
other and -30 points for a weak or no
correlation with each other.
Are we going to add more data elements or
relationships to the parent entity in the near future? If the answer is no (worth
20 points), then there is a stronger tendency to denormalize. This is because we
would not need to worry about the extra space and redundancy new parent data
elements would cause. Also, we do not have to worry about integrity and the
enforcement of new business rules against the parent. It also means that we will
not impact the child entity with parent entity changes, thereby minimizing
database maintenance problems. If the answer is yes (worth -20 points), then we
are going to add data elements in the near future and thus have a tendency to
keep the two entities separate.
This question addresses whether the two
entities have a similar number of inserts and updates within the same time
period. If one of the two entities changes very seldom, whereas the other entity
changes very often, there is a strong tendency to keep them normalized in
separate tables (worth -20 points). Separate tables can avoid performance and
synchronization data issues. If the two entities have roughly the same rate of
change, it would make more sense to denormalize them into the same entity;
therefore, we receive 20 points.
A very good way to keep the scores for each relationship is through a spreadsheet. See the following table. The first column contains each of the six questions. The second column contains the answers and their point values. The third column contains the actual points, and their sum appears at the end.
Denormalization Survival
Guide Questions Template
Question |
Options |
Points |
What is the type of relationship? |
Hierarchy (20)
Peer (–10)
Definition (–20) |
|
What is the participation ratio? |
1-1 to 1-5 (20)
1-5 to 1-100 (–10)
1-100 and up (–20) |
|
How many data elements are in the
parent entity? |
Less than 10 (20)
Between 10 and 20 (–10)
More than 20 (–20) |
|
What is the usage ratio? |
Strong correlation with
each other (30)
Weak or no correlation
with each other (–30) |
|
Is the parent entity a placeholder? |
Yes (–20)
No (20) |
|
Is there a similar rate of change? |
Same (20)
Different (–20) |
|
TOTAL POINTS |
|
|
This article discussed the
questions and answers that comprise the Denormalization Survival Guide when
designing a data mart:
What is the type of relationship? | |
What is the participation ratio? | |
How many data elements are in the parent
entity? | |
What is the usage ratio? | |
Is the parent entity a placeholder? | |
Is there a similar rate of change? |
Although you can use the
questions and point values discussed in this article exactly as they appear, you
might want to consider some customization. You might have different questions
you would like to ask or might assign different point values, depending on how
important each question is in relation to the others.
Steve Hoberman is an expert in the fields of data modeling and data warehousing. He is currently the Lead Data Warehouse Developer for Mars, Inc. He has been data modeling since 1990 across industries as diverse as telecommunications, finance, and manufacturing. Steve speaks regularly for the Data Warehousing Institute. He is the author of the Data Modeler's Workbench, Tools and Techniques for Analysis and Design. Steve specializes in data modeling training, design strategy, and in creating techniques to improve the data modeling process and deliverables. He enjoys reviewing data models and is the founder of Design Challenges, a discussion group which tackles complex data modeling scenarios. To learn more about his data model reviews and to add your email address to the Design Challenge distribution list, please visit his web site at http://www.stevehoberman.com/. He can be reached at me@stevehoberman.com or 973-586-8765.