Tim Peterson
Session #1 – The Structure of OLAP Data
The goal of this seminar is to show you how to rapidly develop an OLAP pilot
project that brings data from a relational database into cubes. The goal of this
first session is to describe the structure of OLAP data.
In the second session we will be describing how to find the OLAP structures in a
relational database. But before we can do that we have to describe the OLAP
structures themselves – what are we trying to create in an OLAP database?
If you haven’t ever browsed an OLAP cube, you need to experience what it’s like.
Try one of the following:
1. If you have access to the Analysis Manager, open it up, and select the
following in the Tree structure on the left – Analysis Services – the name of
your Analysis Server – the sample database called FoodMart 2000 – the Cubes
folder. Right-click the Sales cube and select Browse Data.
2. If you don’t have access to an Analysis Server through the Analysis Manager,
but you do have the Office Web Components from Office 2000 or Office XP, you can
get the tools for trying cube browsing at http://www.localcubetask.com/ocmsfile.htm
OLAP presents data in a different way. You need to be familiar with the look and
feel of OLAP data. You also need to be familiar with some OLAP concepts. The
main content of this session is a definition/description of 25 OLAP terms. Many
of these concepts are needed by everybody who uses OLAP. Others are only
important for people who are developing an OLAP system (like us).
If you’re just starting out with OLAP, try browsing a cube as you’re reading
these definitions. You need to see how dimensions, levels, measures, etc.
actually look to an OLAP end-user.
Here is my Top 25 List of OLAP Concepts. For each one I give a short-hand
definition (which is highlighted) and a lengthier description.
OLAP
Browsing
Data Mining
Cube
Dimension
Level
Drilling Up
Drilling Down
Hierarchy
Member
Set
Member Property (Attribute)
Child Members
Slicing
Dicing
Tuple
Measure (Fact)
Cell
Current Member
Dimension Table
Fact Table
Star Schema
Snowflaking
MDX
Local Cube
OLAP (The Acronym) – OnLine Analytical Processing
– The logical meaning of OLAP would seem to include any computer application
that is used to analyze data, but nobody really uses it like that. OLAP is one
of the worst acronyms I have ever seen – when you tell someone what the letters
stand for, you still don’t have the slightest idea what the software actually
looks like.
OLAP (Popular Definition) - A Million Spreadsheets in A Box – This
is what you can tell people about OLAP who have never seen it before. OLAP is
basically a spreadsheet tool – pretty powerful and flexible – but its basic
purpose is to show spreadsheets. The key for OLAP is the ability to navigate to
different views of the data. You don’t have to ask your technical people every
time you want to see your data in a new way. Your OLAP tool allows you to move
quickly and easily from one perspective to another. Do you see some information
that interests you? With OLAP you can look at that data from a more detailed (or
a more general) perspective.
OLAP (Technical Definition) - Fast, Interactive Browsing of
Multidimensional, Multi-Level Data –
OLAP always involves multiple dimensions, which should have multiple levels. (If
you don’t have levels, your OLAP browsing doesn’t have much power.) OLAP has to
be fast.
OLAP is interactive. OLAP browsing is something done by a human analyst. As the
analysts view the data, they can ask new questions of the data and receive
immediate answers.
If you have a data analysis application that doesn’t return results of new
queries (almost) immediately, you don’t have OLAP. How fast is fast enough? Less
than 5 seconds is OK, but less than 1 second is a lot better. The value of
Analysis Services is that it makes this fast query response possible. The data
is organized in multidimensional structures, with some of the summary (aggregated)
views in the data calculated ahead of time.
If I had my way, we would throw the term OLAP away and replace it with something
like Interactive Spreadsheeting. Or maybe we should replace the term “OLAP
Browsing” with “Data Browsing”.
Data Mining – Computer Analysis of Data that is Not Interactive
– You can use Analysis Services for both OLAP and for data mining. Both are
methods of analyzing data. The big difference is that OLAP is used interactively.
A person browses OLAP data to find the significant information. Data mining is a
process where the computer analyzes data and then reports the significant
results back to the analyst.
(NOTE: There is at least one Analysis Services client tool that allows a user to
switch back and forth between OLAP and data mining. The user can start out with
data mining and then analyze the significant findings with OLAP. Insights gained
from OLAP browsing can then be used for additional data mining.)
Cube – A Multidimensional Way to Look at Data – The cube is the
primary OLAP structure used to view data. It is analogous to a table in the
relational database world. The term cube implies three dimensions, but OLAP folk
have stretched the concept a little bit. An Analysis Services cube can have 128
dimensions (though it’s probably not a good idea to have that many).
Dimensions (Definition #1) - The Perspectives Used for Looking
at the Data – Dimensions are the answer to the question “How do you want to
see your data?” Here are some examples of dimensions –
Product
Time
Store
Customer Age
Customer Income
Employee
Dimensions (Definition #2) - The Categories You Use for the
Columns and Rows in the Spreadsheet - Do you want to see how different
products have done in different time periods? Put the different products on the
columns and the different time periods on the rows (or the other way around).
How about looking at the age of customers who are shopping at different stores?
Use a different row for each age group and a separate column for each store.
Dimensions Have Levels Which Are Used for Drilling Down and
Drilling Up – If you just had dimensions, OLAP browsing would be kind of
dull. The value of OLAP comes in having good levels for your dimensions. Levels
let you see the general view of things and the detailed view of things. If you
notice that sales are higher in a particular month, you will maybe want to drill
down to see if they were higher in a particular part of the month. You also
might want to drill up to a higher level, to see if the data patterns are valid
on a wider scale.
Levels are Organized Into Hierarchies – We should really say that
dimensions have hierarchies and it’s the hierarchies that have levels. But many
dimensions have only one hierarchy, and when they do, that hierarchy can be
ignored, for the most part. But there are times when a single dimension has more
than one hierarchy – such as when the levels of a Time dimension financial are
organized by Calendar Year and by Fiscal Year.
Note: Some OLAP tools don’t handle hierarchies very well. You might not see them
in the one you’re using, or they might just appear as if they were separate
dimensions.
Levels Have Members – the Labels for the Columns and Rows in the
Spreadsheet – If you have a dimension called Time, one of the Levels might be
called Month, and the members of the level Month would be the members January,
February, March, etc. If you have a dimension called Product, you could have a
level called Department, and the members of the level Department could be
Hardware, Plumbing, Lumber, etc.
Drilling Down to Child Members – Drilling down is what you do when
you’re looking at data at a particular level and you want to see more detailed
data. When you drill down to see the details at the next lower level, the
members you see are called child members. For example, if you drill down on
Quarter 1, you will see the child members January, February, and March.
A whole variety of family terms are used to describe the relationship between
members within a dimension – parent, sibling, cousin, descendants, ancestors,
etc.
A Collection of Members is Called a Set - You nearly always
want to see more than one column and more than one row in a spreadsheet. The
group of members that is shown on the columns and the group of members that is
shown on the rows are called sets. Sets can be defined in OLAP in a large number
of different ways. Here are a few examples
1. By listing each member – {January, February, March}
2. By a family relationship – the children of Quarter 1 – [Quarter 1].Children
3. All the members of a level – Time.Month.Members
4. All the members of a dimension – Time.Members
Members Have Extra Information Called Member Properties (or
Attributes) – You can display extra information about members of any
level by creating member properties. An Employee level could have member
properties such as Phone Number, Birth Date, and Hire Date. A City level could
have a member property Population, which could be used to calculate per capita
sales.
Note: Sometimes you have to make a choice between using some information to make
a level or to make a member property. In FoodMart 2000, for example, there’s a
dimension called Store. There is a member property for the Store Name level
called Store Type. This member property could have been used to create a level,
which could have been used in a second hierarchy, allowing store sales to be
viewed by Store Type. When you are making decisions about creating OLAP
structures (as in the assignment for the next session), you will have to make
decisions like this.
OLAP Filtering Is Called Slicing – You can use a dimension for
slicing when it isn’t being displayed on the columns or rows of the spreadsheet.
You can slice on any member from any level of the dimension. If you slice on the
member January in the Month level of the Time dimension, you will only see data
from January.
Dimensions Can Be Combined For Dicing – You can put more than one
dimension on the rows or on the columns. You will see one row for every
combination of the members from each of the dimensions. If you put the children
of Quarter 1 from the Time dimension and the members of the Product Family level
of the Product dimension on the rows, you would have nine rows, as follows:
January Food
January Drink
January Non-Consumable
February Food
February Drink
February Non-Consumable
March Food
March Drink
March Non-Consumable
When you are cutting the data with one member it’s called slicing. When you cut
your data with sets of members from two or more dimensions, it’s called dicing,
like you’re cutting up vegetables from every direction.
The Members Defining a Row (or a Column) Are Called a Tuple – In
geometry, a point is defined by its x-y coordinates (x, y). You can think of x
and y as being members of the x-dimension and the y-dimension. In OLAP, each row
or column is defined by the members that are used for that row. The first row in
the previous example is defined by the tuple (January, Food).
A Group of Tuples Are Called a Set – The 9 tuples in the example
are also called a set. Before I said that a group of members make a set. That’s
true, sort of, but it’s really a group of tuples that make a set. In the
simplest case, those tuples only have a single member.
To summarize members, tuples, and sets:
The rows and the columns of an OLAP spreadsheet are always defined by a set of
tuples.
Each row or column is defined by a single tuple.
Each tuple is defined by a single member from one or more dimensions. In a
simple case, when only a single dimension is used for a row or column, the tuple
has only a single member.
(NOTE: You don’t really have to understand the concepts of tuples and sets when
you’re first starting with OLAP, but they’re very important when creating
calculations, so you might as well start thinking about them. If you don’t see
the point of these concepts at first, don’t worry. But if you start hearing
about them now, you’ll have an easier time when creating calculations a little
later.)
The Numbers Are Called Measures (or Facts) – The numbers in
the OLAP spreadsheet are called measures. When setting up OLAP cubes, these
values are also often called facts. Typical measures or facts would be:
Sales Dollars
Sales Count
Profit
Hours of Work
The Numbers are Displayed in the Cells of a Cube – The spaces in the OLAP
spreadsheet are called cells (like the cells in an Excel spreadsheet).
Each Cell Has One (And Only One!) Current Member From Every Dimension
– Every cell in a cube is defined by one member from every dimension in the cube.
That one member is called the current member for that dimension. The current
dimension for each dimension is determined as follows:
1. If the dimension is used on the columns or on the rows, the current member is
the member for that dimension used in the tuple defining the column or row.
2. If the dimension is used for slicing, the current member is the member used
in the slice.
3. If the dimension is not used for columns, rows, or slicing, the current
member is the default member for the dimension. The default member is often the
All level member. When the All level member is used, the practical effect is to
ignore that particular dimension in the display of the data.
Dimension Information Is Stored In Dimension Tables – The
information used for OLAP dimensions is contained in relational database tables
which are often called dimension tables. Dimension tables have the following
types of fields:
1. Key fields, used to join the dimension tables to the fact table (and to other
dimension tables when snowflaking).
2. Level name fields, used to store the member names for the levels. For example,
the Time dimension table could have a field called Month, which would have
values such as January, February, March, etc.
3. Level Order Key fields, used to store integer values used to order the
members of the levels (if necessary). For example, the Time dimension table
could have a field called Month Order Key, which could have a value of 1 for
January, 2 for February, 3 for March, etc.
4. Member Property fields, used to store the member property information. A Time
dimension could have a field called Day Count, which would store the number of
days for each month.
Information for the Measures Is Stored in the Fact Table – The
information used for the measures is contained in a relational database table
which is called the fact table. Whenever you make a new cube in the Analysis
Manager, the first question you’re asked is to identify the fact table. Fact
tables have the following types of fields:
1. Key fields connecting the fact table to a dimension table for each of the
dimensions. Every fact must be connected to one particular member from the
lowest level of every one of the dimensions in the cube.
2. Measure fields, containing the numeric values used for the measures.
The Star Schema – A Multidimensional Data Structure in a Relational
Database – The combination of a fact table with its dimension tables is
called a star schema. The fact table is at the center of the star, while each
dimension table represents a point of the star.
Snowflaking the Star Schema (The Snowflake Schema) – The
basic star schema has a simple structure where all the information for each
dimension is contained in a single table. Snowflaking is the process of dividing
information for one dimension among two or more tables. The resulting data
structure is often called a snowflake schema. (I actually prefer to call this
structure a star schema that has some snowflaking, but I think my preference for
this terminology is a minority opinion.)
Analysis Services comes with a sample database called FoodMart 2000. This OLAP
database is based on a Microsoft Access database, which is located, by default,
at C:\Program Files\Microsoft Analysis Services\Samples\foodmart 2000.mdb. Take
a look at this database to see dimension tables and fact tables.
Right click on the Sales cube from the FoodMart 2000 database in the Analysis
Manager and select Edit. Switch to the Schema tab. You will see a star schema
that has snowflaking in the Product dimension.
MDX is the Querying Language for OLAP – A special language has
been developed for OLAP. This language looks similar to SQL, but it has many
unique features. MDX is used for several purposes:
1. To create calculated members – members which are not in the data, but can be
derived from the existing members and measures.
2. To retrieve data from a cube.
3. To create local cube files.
4. To create calculated sets.
5. To define security rules for accessing data.
6. To create actions
Local Cubes Are Files That Can Be Used For Disconnected Access to Cube
Data – OLAP cubes are stored in an Analysis Server. Applications can browse
these cubes by connecting to the Analysis Server. But cube data can also be put
into local cube files, which can be used when users do not have a connection to
the Analysis Server. Browsing a local cube file appears to the users to be the
same thing as browsing a cube stored in an Analysis Server. I think local cubes
are particularly useful in a pilot project, when everyone is not yet hooked up
to an Analysis Server.
The OLAP 25
Those are my top 25 OLAP concepts. If you understand those terms, you’re a long
way toward understanding how OLAP data is structured and used.
Finding the Measures, Dimensions, Levels, and Member Properties
The next step is finding measures, dimensions, and levels in a relational
database. It’s easy to build cubes if you already have your data organized into
a star schema, as is the case with the FoodMart 2000 database. If you have your
data in a normal relational database, it’s not so obvious what should be used
for measures, dimensions, levels, and member properties.
There are two ways of deciding on the structures that you are going to include
in your cubes:
1. Deciding on OLAP structures by looking at the spreadsheets that are currently
being used.
2. Deciding on OLAP structures by looking at the elements that are available in
the source data.
Both strategies are important. If the current spreadsheets are being used, the
information in them should probably be included in the OLAP cubes. But the
source data may have additional information which has been ignored in existing
reports. When you’re building an OLAP system, you have the opportunity to give
your users more of the data than they’ve seen before.
Session #2 of this seminar is going to teach you how to find OLAP structures by
looking in a relational database. The
Preparation Questions
for Session #2 are going to get you started on this process.
Feedback
So what do you think?
Do you have questions that are raised by what I have written?
Do you disagree with anything I have written?
Are there some points that should receive more emphasis?
Can you think of a more effective way to define or explain any of the concepts I
have defined?
Are there some important concepts that I should have mentioned, but didn’t?