|
|
Acton Burnell White Paper Meta Data Management
NOTE: This paper
was prepared for an audience of data warehouse developers and users. It
addresses metadata management in the context of a data warehouse
environment.
Meta Data Management
Acton Burnell has specialized in metadata
management for years. This paper presents our insights into the subject,
plus some of the assignments that taught us those lessons.
1.
The Meta Data Territory Ten years ago, meta data was
a terribly obscure term. If you used it, you first had to define it; then,
you had to demonstrate that you were nonetheless a practical person with
something useful to say to your audience. All that has changed. A glance
at trade magazines or seminar offerings shows that the term meta
data appears everywhere. Nevertheless, there are two separate
conversations going on – one about business meta data and one about
technical meta data. It’s a useful distinction to
understand.
Business Meta Data for Business
Users Business users are now asking for meta data, thanks to data
warehousing. With data warehouses, we are publishing data far beyond the
narrow stovepipe systems of the past. When only the sales department used
the sales tracking system, no one had to be told what Monthly Sales by
Customer included or how it was calculated. Everyone knew. Publish
that data in a warehouse, though, and suddenly a lot of people see the
figures who have no idea what they mean. For example, does Sales
refer orders booked or goods shipped? And who’s a Customer, perhaps
the knottiest piece of meta data in any organization? These are meta data
questions. Data warehouse users know they need meta data, and they’re
asking for it by that name.
Technical Meta Data for System
Developers For system builders, the meta data story has more
threads.
At one level are the data warehouse builders who must
assemble data from diverse sources. After handling the technical issues of
connectivity and format conversion, they are left with the apples and
oranges problem: a "sale" in source A turns out to be a different kind of
thing from a "sale" in source B. How can this data be consolidated? Notice
that this is not an IT question. The business user cares about the answer.
He will want to know what solution was chosen. The solution to the apples
and oranges problem becomes part of the business meta data
discussed above.
At the next level are the data warehouse builders
and the technical meta data they create in order to build the
system. Strictly speaking, this is just traditional system documentation.
The designer lays out the programs and data stores through which data
flows from source to user. He then documents this somehow (or he doesn’t).
What’s new about that? What’s new is the spotlight that data warehouse
applications throw on this meta data. We are finding unusual business
value in having this meta data both exist and be retrievable. Here’s
why:
 | Business meta data queries. Remember, the business user is
now looking for meta data about unfamiliar information. When the
business user frowns, "This number can’t be right. Where did it come
from?", he triggers a technical inquiry into source data tables, extract
programs, cleanup algorithms, and so on. It’s no longer garbage in,
gospel out. It’s garbage in, technical meta data inquiry out.
Of course, the business user doesn’t do that inquiry himself. IT
does.
 | Expansion and Reuse. Data warehouses are extremely popular.
If you are successful, you will soon be expanding the data warehouse.
Or, if you build a parallel warehouse instead, you will want to reuse
that extraction logic that takes data from the personnel system. Either
way, there’s value in having the technical meta data readily
available.
| |
Lastly – having nothing particular to do with data warehouses –
component reuse is coming to the industry. Distributed objects, CORBA
objects, DCOM, ActiveX controls, Java Beans – all of these are aiming at a
world of plug-in software development. Meta data about those components is
absolutely vital to the success of these efforts. The components must be
described and cataloged. When included in a system design, that dependency
must be recorded. With such extremely complex relationships, meta data
management becomes vital.
2. Managing Business
Meta Data The management of business meta data comes
into play when the organization looks beyond its collection of stovepipe
systems.
The Stovepipe Approach Any given standalone
system doesn’t have much requirement for meta data, much less the
management of it. We usually find that:
 | The users know the definitions of the data elements, allowable
codes, and so on.
 | Local work-arounds are implemented and taught by word of mouth (such
as that state code "XX" means that the work is being billed to that
special project in Germany).
 | The users don’t know how certain calculations were done, but
the output seems OK, so life goes on.
 | If the system needs inputs from another system, a point-to-point
bridge program is built. The transformation logic is built into the
code. The design documents are mislaid. No attempt is made to create a
generalizable solution to the data transformations.
| | | |
This is not bad. It is a perfectly workable state of affairs in
a standalone world. It has the great virtue of being cheap to build. It
is, however, expensive to maintain and to integrate.
The Central
Planning Model The late 1980’s saw many efforts to remedy the
stovepipe problem by going full over in the opposite direction: central
planning. The enterprise’s systems will be redesigned from the top down.
If Customer is modeled and defined centrally, then all systems will
use the same definition (and perhaps the same database). Integration will
become easy. The apples and oranges problem will go away, along with the
complex and numerous bridge programs that consume so many maintenance
dollars. IBM’s AD/Cycle and MVS Repository were children of this idea, as
were Information Engineering, Texas Instruments’ IEF CASE/development
product, and many others.
Acton Burnell worked on several such
projects in those days. Coming in at the working level, we quickly learned
that the core idea is elegant, but impossible to execute. More precisely,
we learned that:
 | Standardization costs.
 | Not everything needs to be shared.
 | Where there’s no business requirement for sharing, standardization
has no payoff and will not get done.
| | |
However, along the way, we learned the basic skills of data
integration, skills now much in demand for data warehousing. These skills
include data analysis, semantic reconciliation, data administration, and
repositories. And of course, we learned to gauge what’s possible and
appropriate in a particular organization.
What
Works
Choosing How Much to Bite Off Regarding meta
data, every organization continues to face the same fundamental tension
that we saw in the stovepipe model vs. the central-planning model: how
much to bite off. We see this tension in articles advocating the corporate
data warehouse as opposed to departmental data marts. The advantage of the
big warehouse is that it allows comparisons and roll-ups across a large
part of the organization. But it’s also much harder to build. (It’s
bigger, and it gets into a lot more people’s business.) By contrast, the
departmental data mart has less value, but also less cost and
risk.
There is no right answer to this choice, but it should be
done candidly before deciding what metadata to manage and how to do it. A
clear view of the business value, costs, risks, and politics of the
program is essential.
The U.S. Department of
State was implementing a data administration program to improve data
sharing across the department. They were having trouble implementing a
top-down program. We helped them re-orient their program toward one that
was a better fit with the highly autonomous bureaus within the
department.
We are helping an international snack food company
implement a corporate inter-system data transfer facility. Using it,
existing computer systems that participate can exchange data using
standard definitions. This will greatly reduce the inventory of
bi-lateral bridge programs that must be maintained. Our contribution was
to know how much to standardize, as well as the nuts and bolts of
how to do it.
Publish, Publish, Publish We believe that publishing is the
overlooked part of meta data management. By publishing definitions to the
business users, you deliver value to them. This builds support for the
program. Too often, the meta data stops with the software development
team. They write down the business meaning of the data they need. They
work out the extract algorithms and the clean-up logic. They build the
system (data warehouse or other). Then the meta data disappears from
view.
For a commercial
client, we built desktop software that allows business users to look up
the definitions of commonly used business terms and product names.
Although a side product of our meta data management work, this proved
very popular, and a key to retaining funding until the larger effort
showed results.
For the IT shop of USF&G Insurance Company,
we fleshed out the vision for a "System Component Center". The heart of
this was meta data catalog of reusable IT components at every level. We
helped pick the repository tool that held the catalog. We also developed
materials for a publicity campaign to the IT staff to promote use of the
catalog. All of our work came from our belief in the importance of
publishing meta data.
3. Tools for Business
Meta Data Tools are essential for collecting, storing, and
publishing business meta data.
Collection Tools For
collection, we use data models and modeling tools. Even the smallest data
warehouse project has more business meta data than people can easily
remember. And the nuances are important.
We usually begin with a
CASE tool (e.g., ERwin, from Logicworks). Although the CASE tools do not
force the modeler to enter definitions, it is essential that she do so.
Information, expertise, and motivation all come together at this point.
The definitions of things (and their allowable values, and their
appropriate names) must be captured at this
point.
Storage Tools CASE tools have historically made
poor catalogs. That is, if you want to look up the definition of Gross
Sales Amount, the CASE tool does not present a list of items from
which you choose. Instead, you must first find the item in the model
diagram.
Furthermore, the CASE tool does a poor job of storing
information about data transformations. For example, if Gross Sales
Amount in the data warehouse derives from a flat file input calculated
against a table of foreign exchange rates, that logic is hard to store in
the data model. Indeed, the flat file may not be representable at
all.
This is where a repository can be useful. These products are
designed specifically to store meta data –especially all of the
relationships between items – in ways which are queryable.
Acton Burnell are
repository experts. We have built and sold our own low-end product (PC
Dictionary). We have set up, programmed, and used all of the major
repository products (Rochade, Platinum, DATAMANAGER, etc.) We have also
created limited-function repositories in Microsoft Access where that’s
been sufficient.
Publishing Tools Today, vendors of some warehouse query tools
advertise that users can "click on a report column and get the business
definition". This is excellent. However, the movement of the business meta
data from its origination point (the CASE tool) into the query tool is
less than smooth. Likewise, the management of that meta data over time is
not smooth. The result is often duplicate stores of meta data.
For a large insurance
company’s Claims data mart project, we developed a Microsoft Access
application which published all relevant business meta data in the data
mart. This repository of meta data was essential to the success of the
data mart. Without it, the business users had no faith in the integrity
of the data in the mart. With it, they could get, not only business
definitions of the data items, but also detailed explanations of
source-to-mart data transformations.
For one small data
warehouse, we published meta data as a Windows Help file embedded inside
the client/server warehouse viewer. The meta data came directly from the
ERwin data model into Word, and then into a Help
file.
For our PC Dictionary
product, we sell a client/server repository browser. Designed as an
easy-to use manager’s tool, this allows anyone to look up meta data in
the repository with no training.
We look forward to seeing better integration between the tools that
store business meta data.
4. Managing Technical
Metadata The bad news is that technical meta data is even
more fragmented than business meta data (and more voluminous). The good
news is that vendors in this arena acknowledge the role of technical meta
data and have partial solutions.
Because of our
extensive experience with large repository products, we do not see these
as suitable for managing technical meta data. Although these tools can
hold and report the meta data, loading the meta data is difficult.
Repository import programs are expensive, quirky, and limited. Worse, if
items are separately loaded, their relationships must be manually set in
the repository. This is workload that no IT shop will sustain on any
scale.
Extract Here, we have been impressed with a product called
DataStage from Ardent (formerly V-Mark.) It acts as a shell around the
entire extract, clean, transform, and load process. It automatically loads
meta data regarding source and target structures. It allows the developer
to express the overall flow graphically, with drill-downs into the
detailed transformations that must happen. It includes it’s own coding
language for simple logic, and the ability to call external routines for
more complicated transforms. Furthermore, it’s repository-based, so design
meta data about the entire supply-side of the warehouse is stored in one
place. However, the repository is proprietary, useless outside of
DataStage itself.
Clean Data being fed into data
warehouses is notoriously dirty. Popular estimates are that 60% of a
warehouse development project will go into handling data quality problems.
That matches our experience. We begin the project by assessing the quality
of the data sources visually.
In all cases, we begin
with a visual scan of the source data. Where the data is relational, we
use Microsoft Access.
More thorough examination of data quality requires
automation.
We have done data
quality audits for many clients. We often use a product from Prism
called PQM (Prism Quality Manager). We have packaged these data quality
services into an offering called PerAudit.
For runtime operation, we do not use PQM. We manually code the
necessary quality checks and actions.
Ideally, this runtime code
could include salvage from the meta data used by PQM. Unfortunately, PQM
(like others) stores this quality-check meta data in its own proprietary
repositories. It also expresses the meta data using a proprietary
language. Thus, the meta data used to evaluate quality up front (via these
tools) cannot be moved into the production system
directly.
Transform This is DataStage’s strength. Other
products in this area come from Prism, ETI, and Carleton. Prism stresses
that it is repository-based. Again, though, the repository is a
proprietary one.
Load The fastest database loads use
loaders supplied by the DBMS vendors. To operate, these require meta data
instructions (take this string and put it in field "X"). We have not found
a product which can generate these loader instructions (although tools
such as DataStage bypass the loaders).
5. The Microsoft
Repository In the spring of 1997, Microsoft released the
Microsoft Repository, version 1.0. It bundles the product with the
high-end version of its Visual Basic 5.0 product.
Our hope, widely
shared, is that Microsoft will have enough clout to bring standardization
to the fragmented world of tool repositories. Already, dozens of tool
vendors have announced plans to make their tools compatible in some
fashion with the Microsoft Repository. At the same time, Microsoft is
working with industry groups to extend the tool’s metamodel. Among the
enhancements will be better support for data items. (The initial release
is heavily weighted towards program artifacts.)
Today, the
Microsoft Repository is not a useful tool for meta data management of the
kind discussed here. However, in four years, it might be. With that done,
business meta data created in a CASE tool might progress cleanly to the
user’s desktop. And technical meta data created by the system designer
could be easily retrievable by the maintenance programmer tracking down a
suspect derivation. And the repository itself could hold the official meta
data. Or, perhaps, none of this will happen, any more than it did with the
IBM Repository. We are paying close attention to developments along this
front.

Home | Solutions |What's New | Our Customers White Papers/Briefings
| Press Room | Joining Our Team | Contact Info
Copyright© 2000 Acton Burnell. All Rights
Reserved.
1500 N. Beauregard Street, Suite 210 • Alexandria, VA
22311-1715 Phone: (703) 671-0700 • Fax: (703) 671-8938 Email: webmaster@actonb.com
|
|