Tim Peterson

Session #1 – The Structure of OLAP Data


The goal of this seminar is to show you how to rapidly develop an OLAP pilot project that brings data from a relational database into cubes. The goal of this first session is to describe the structure of OLAP data.

In the second session we will be describing how to find the OLAP structures in a relational database. But before we can do that we have to describe the OLAP structures themselves – what are we trying to create in an OLAP database?

If you haven’t ever browsed an OLAP cube, you need to experience what it’s like. Try one of the following:

1. If you have access to the Analysis Manager, open it up, and select the following in the Tree structure on the left – Analysis Services – the name of your Analysis Server – the sample database called FoodMart 2000 – the Cubes folder. Right-click the Sales cube and select Browse Data.

2. If you don’t have access to an Analysis Server through the Analysis Manager, but you do have the Office Web Components from Office 2000 or Office XP, you can get the tools for trying cube browsing at http://www.localcubetask.com/ocmsfile.htm

OLAP presents data in a different way. You need to be familiar with the look and feel of OLAP data. You also need to be familiar with some OLAP concepts. The main content of this session is a definition/description of 25 OLAP terms. Many of these concepts are needed by everybody who uses OLAP. Others are only important for people who are developing an OLAP system (like us).

If you’re just starting out with OLAP, try browsing a cube as you’re reading these definitions. You need to see how dimensions, levels, measures, etc. actually look to an OLAP end-user.

Here is my Top 25 List of OLAP Concepts. For each one I give a short-hand definition (which is highlighted) and a lengthier description.

OLAP
Browsing
Data Mining
Cube
Dimension
Level
Drilling Up
Drilling Down
Hierarchy
Member
Set
Member Property (Attribute)
Child Members
Slicing
Dicing
Tuple
Measure (Fact)
Cell
Current Member
Dimension Table
Fact Table
Star Schema
Snowflaking
MDX
Local Cube

OLAP (The Acronym) – OnLine Analytical Processing – The logical meaning of OLAP would seem to include any computer application that is used to analyze data, but nobody really uses it like that. OLAP is one of the worst acronyms I have ever seen – when you tell someone what the letters stand for, you still don’t have the slightest idea what the software actually looks like.

OLAP (Popular Definition) - A Million Spreadsheets in A Box – This is what you can tell people about OLAP who have never seen it before. OLAP is basically a spreadsheet tool – pretty powerful and flexible – but its basic purpose is to show spreadsheets. The key for OLAP is the ability to navigate to different views of the data. You don’t have to ask your technical people every time you want to see your data in a new way. Your OLAP tool allows you to move quickly and easily from one perspective to another. Do you see some information that interests you? With OLAP you can look at that data from a more detailed (or a more general) perspective.

OLAP (Technical Definition) - Fast, Interactive Browsing of Multidimensional, Multi-Level Data – 

OLAP always involves multiple dimensions, which should have multiple levels. (If you don’t have levels, your OLAP browsing doesn’t have much power.) OLAP has to be fast. 

OLAP is interactive. OLAP browsing is something done by a human analyst. As the analysts view the data, they can ask new questions of the data and receive immediate answers.

If you have a data analysis application that doesn’t return results of new queries (almost) immediately, you don’t have OLAP. How fast is fast enough? Less than 5 seconds is OK, but less than 1 second is a lot better. The value of Analysis Services is that it makes this fast query response possible. The data is organized in multidimensional structures, with some of the summary (aggregated) views in the data calculated ahead of time.

If I had my way, we would throw the term OLAP away and replace it with something like Interactive Spreadsheeting. Or maybe we should replace the term “OLAP Browsing” with “Data Browsing”. 

Data Mining – Computer Analysis of Data that is Not Interactive – You can use Analysis Services for both OLAP and for data mining. Both are methods of analyzing data. The big difference is that OLAP is used interactively. A person browses OLAP data to find the significant information. Data mining is a process where the computer analyzes data and then reports the significant results back to the analyst.

(NOTE: There is at least one Analysis Services client tool that allows a user to switch back and forth between OLAP and data mining. The user can start out with data mining and then analyze the significant findings with OLAP. Insights gained from OLAP browsing can then be used for additional data mining.)

Cube – A Multidimensional Way to Look at Data – The cube is the primary OLAP structure used to view data. It is analogous to a table in the relational database world. The term cube implies three dimensions, but OLAP folk have stretched the concept a little bit. An Analysis Services cube can have 128 dimensions (though it’s probably not a good idea to have that many).

Dimensions (Definition #1) - The Perspectives Used for Looking at the Data – Dimensions are the answer to the question “How do you want to see your data?” Here are some examples of dimensions – 

Product
Time
Store
Customer Age
Customer Income
Employee

Dimensions (Definition #2) - The Categories You Use for the Columns and Rows in the Spreadsheet - Do you want to see how different products have done in different time periods? Put the different products on the columns and the different time periods on the rows (or the other way around). How about looking at the age of customers who are shopping at different stores? Use a different row for each age group and a separate column for each store.

Dimensions Have Levels Which Are Used for Drilling Down and Drilling Up – If you just had dimensions, OLAP browsing would be kind of dull. The value of OLAP comes in having good levels for your dimensions. Levels let you see the general view of things and the detailed view of things. If you notice that sales are higher in a particular month, you will maybe want to drill down to see if they were higher in a particular part of the month. You also might want to drill up to a higher level, to see if the data patterns are valid on a wider scale.

Levels are Organized Into Hierarchies – We should really say that dimensions have hierarchies and it’s the hierarchies that have levels. But many dimensions have only one hierarchy, and when they do, that hierarchy can be ignored, for the most part. But there are times when a single dimension has more than one hierarchy – such as when the levels of a Time dimension financial are organized by Calendar Year and by Fiscal Year.

Note: Some OLAP tools don’t handle hierarchies very well. You might not see them in the one you’re using, or they might just appear as if they were separate dimensions.

Levels Have Members – the Labels for the Columns and Rows in the Spreadsheet – If you have a dimension called Time, one of the Levels might be called Month, and the members of the level Month would be the members January, February, March, etc. If you have a dimension called Product, you could have a level called Department, and the members of the level Department could be Hardware, Plumbing, Lumber, etc.

Drilling Down to Child Members – Drilling down is what you do when you’re looking at data at a particular level and you want to see more detailed data. When you drill down to see the details at the next lower level, the members you see are called child members. For example, if you drill down on Quarter 1, you will see the child members January, February, and March.

A whole variety of family terms are used to describe the relationship between members within a dimension – parent, sibling, cousin, descendants, ancestors, etc.

A Collection of Members is Called a Set - You nearly always want to see more than one column and more than one row in a spreadsheet. The group of members that is shown on the columns and the group of members that is shown on the rows are called sets. Sets can be defined in OLAP in a large number of different ways. Here are a few examples

1. By listing each member – {January, February, March}
2. By a family relationship – the children of Quarter 1 – [Quarter 1].Children
3. All the members of a level – Time.Month.Members
4. All the members of a dimension – Time.Members

Members Have Extra Information Called Member Properties (or Attributes) – You can display extra information about members of any level by creating member properties. An Employee level could have member properties such as Phone Number, Birth Date, and Hire Date. A City level could have a member property Population, which could be used to calculate per capita sales.

Note: Sometimes you have to make a choice between using some information to make a level or to make a member property. In FoodMart 2000, for example, there’s a dimension called Store. There is a member property for the Store Name level called Store Type. This member property could have been used to create a level, which could have been used in a second hierarchy, allowing store sales to be viewed by Store Type. When you are making decisions about creating OLAP structures (as in the assignment for the next session), you will have to make decisions like this.

OLAP Filtering Is Called Slicing – You can use a dimension for slicing when it isn’t being displayed on the columns or rows of the spreadsheet. You can slice on any member from any level of the dimension. If you slice on the member January in the Month level of the Time dimension, you will only see data from January.

Dimensions Can Be Combined For Dicing – You can put more than one dimension on the rows or on the columns. You will see one row for every combination of the members from each of the dimensions. If you put the children of Quarter 1 from the Time dimension and the members of the Product Family level of the Product dimension on the rows, you would have nine rows, as follows:

January Food
January Drink
January Non-Consumable
February Food
February Drink
February Non-Consumable
March Food
March Drink
March Non-Consumable

When you are cutting the data with one member it’s called slicing. When you cut your data with sets of members from two or more dimensions, it’s called dicing, like you’re cutting up vegetables from every direction. 

The Members Defining a Row (or a Column) Are Called a Tuple – In geometry, a point is defined by its x-y coordinates (x, y). You can think of x and y as being members of the x-dimension and the y-dimension. In OLAP, each row or column is defined by the members that are used for that row. The first row in the previous example is defined by the tuple (January, Food).

A Group of Tuples Are Called a Set – The 9 tuples in the example are also called a set. Before I said that a group of members make a set. That’s true, sort of, but it’s really a group of tuples that make a set. In the simplest case, those tuples only have a single member.

To summarize members, tuples, and sets:

The rows and the columns of an OLAP spreadsheet are always defined by a set of tuples.
Each row or column is defined by a single tuple.
Each tuple is defined by a single member from one or more dimensions. In a simple case, when only a single dimension is used for a row or column, the tuple has only a single member.

(NOTE: You don’t really have to understand the concepts of tuples and sets when you’re first starting with OLAP, but they’re very important when creating calculations, so you might as well start thinking about them. If you don’t see the point of these concepts at first, don’t worry. But if you start hearing about them now, you’ll have an easier time when creating calculations a little later.)

The Numbers Are Called Measures (or Facts) – The numbers in the OLAP spreadsheet are called measures. When setting up OLAP cubes, these values are also often called facts. Typical measures or facts would be:

Sales Dollars
Sales Count
Profit
Hours of Work

The Numbers are Displayed in the Cells of a Cube – The spaces in the OLAP spreadsheet are called cells (like the cells in an Excel spreadsheet).

Each Cell Has One (And Only One!) Current Member From Every Dimension – Every cell in a cube is defined by one member from every dimension in the cube. That one member is called the current member for that dimension. The current dimension for each dimension is determined as follows:

1. If the dimension is used on the columns or on the rows, the current member is the member for that dimension used in the tuple defining the column or row.
2. If the dimension is used for slicing, the current member is the member used in the slice.
3. If the dimension is not used for columns, rows, or slicing, the current member is the default member for the dimension. The default member is often the All level member. When the All level member is used, the practical effect is to ignore that particular dimension in the display of the data.

Dimension Information Is Stored In Dimension Tables – The information used for OLAP dimensions is contained in relational database tables which are often called dimension tables. Dimension tables have the following types of fields:

1. Key fields, used to join the dimension tables to the fact table (and to other dimension tables when snowflaking).
2. Level name fields, used to store the member names for the levels. For example, the Time dimension table could have a field called Month, which would have values such as January, February, March, etc.
3. Level Order Key fields, used to store integer values used to order the members of the levels (if necessary). For example, the Time dimension table could have a field called Month Order Key, which could have a value of 1 for January, 2 for February, 3 for March, etc.
4. Member Property fields, used to store the member property information. A Time dimension could have a field called Day Count, which would store the number of days for each month.

Information for the Measures Is Stored in the Fact Table – The information used for the measures is contained in a relational database table which is called the fact table. Whenever you make a new cube in the Analysis Manager, the first question you’re asked is to identify the fact table. Fact tables have the following types of fields:

1. Key fields connecting the fact table to a dimension table for each of the dimensions. Every fact must be connected to one particular member from the lowest level of every one of the dimensions in the cube.
2. Measure fields, containing the numeric values used for the measures.

The Star Schema – A Multidimensional Data Structure in a Relational Database – The combination of a fact table with its dimension tables is called a star schema. The fact table is at the center of the star, while each dimension table represents a point of the star.

Snowflaking the Star Schema (The Snowflake Schema) – The basic star schema has a simple structure where all the information for each dimension is contained in a single table. Snowflaking is the process of dividing information for one dimension among two or more tables. The resulting data structure is often called a snowflake schema. (I actually prefer to call this structure a star schema that has some snowflaking, but I think my preference for this terminology is a minority opinion.) 

Analysis Services comes with a sample database called FoodMart 2000. This OLAP database is based on a Microsoft Access database, which is located, by default, at C:\Program Files\Microsoft Analysis Services\Samples\foodmart 2000.mdb. Take a look at this database to see dimension tables and fact tables. 

Right click on the Sales cube from the FoodMart 2000 database in the Analysis Manager and select Edit. Switch to the Schema tab. You will see a star schema that has snowflaking in the Product dimension.

MDX is the Querying Language for OLAP – A special language has been developed for OLAP. This language looks similar to SQL, but it has many unique features. MDX is used for several purposes:

1. To create calculated members – members which are not in the data, but can be derived from the existing members and measures.
2. To retrieve data from a cube.
3. To create local cube files.
4. To create calculated sets.
5. To define security rules for accessing data.
6. To create actions

Local Cubes Are Files That Can Be Used For Disconnected Access to Cube Data – OLAP cubes are stored in an Analysis Server. Applications can browse these cubes by connecting to the Analysis Server. But cube data can also be put into local cube files, which can be used when users do not have a connection to the Analysis Server. Browsing a local cube file appears to the users to be the same thing as browsing a cube stored in an Analysis Server. I think local cubes are particularly useful in a pilot project, when everyone is not yet hooked up to an Analysis Server.

The OLAP 25 

Those are my top 25 OLAP concepts. If you understand those terms, you’re a long way toward understanding how OLAP data is structured and used. 

Finding the Measures, Dimensions, Levels, and Member Properties

The next step is finding measures, dimensions, and levels in a relational database. It’s easy to build cubes if you already have your data organized into a star schema, as is the case with the FoodMart 2000 database. If you have your data in a normal relational database, it’s not so obvious what should be used for measures, dimensions, levels, and member properties.

There are two ways of deciding on the structures that you are going to include in your cubes:

1. Deciding on OLAP structures by looking at the spreadsheets that are currently being used.
2. Deciding on OLAP structures by looking at the elements that are available in the source data.

Both strategies are important. If the current spreadsheets are being used, the information in them should probably be included in the OLAP cubes. But the source data may have additional information which has been ignored in existing reports. When you’re building an OLAP system, you have the opportunity to give your users more of the data than they’ve seen before.

Session #2 of this seminar is going to teach you how to find OLAP structures by looking in a relational database. The Preparation Questions for Session #2 are going to get you started on this process. 

Feedback 

So what do you think?

Do you have questions that are raised by what I have written?

Do you disagree with anything I have written?

Are there some points that should receive more emphasis?

Can you think of a more effective way to define or explain any of the concepts I have defined?

Are there some important concepts that I should have mentioned, but didn’t?