Acton Burnell: Library - Metadata Management

Acton Burnell White Paper
Meta Data Management

NOTE: This paper was prepared for an audience of data warehouse developers and users. It addresses metadata management in the context of a data warehouse environment.

Meta Data Management

Acton Burnell has specialized in metadata management for years. This paper presents our insights into the subject, plus some of the assignments that taught us those lessons.

1. The Meta Data Territory
Ten years ago, meta data was a terribly obscure term. If you used it, you first had to define it; then, you had to demonstrate that you were nonetheless a practical person with something useful to say to your audience. All that has changed. A glance at trade magazines or seminar offerings shows that the term meta data appears everywhere. Nevertheless, there are two separate conversations going on – one about business meta data and one about technical meta data. It’s a useful distinction to understand.

Business Meta Data for Business Users
Business users are now asking for meta data, thanks to data warehousing. With data warehouses, we are publishing data far beyond the narrow stovepipe systems of the past. When only the sales department used the sales tracking system, no one had to be told what Monthly Sales by Customer included or how it was calculated. Everyone knew. Publish that data in a warehouse, though, and suddenly a lot of people see the figures who have no idea what they mean. For example, does Sales refer orders booked or goods shipped? And who’s a Customer, perhaps the knottiest piece of meta data in any organization? These are meta data questions. Data warehouse users know they need meta data, and they’re asking for it by that name.

Technical Meta Data for System Developers
For system builders, the meta data story has more threads.

At one level are the data warehouse builders who must assemble data from diverse sources. After handling the technical issues of connectivity and format conversion, they are left with the apples and oranges problem: a "sale" in source A turns out to be a different kind of thing from a "sale" in source B. How can this data be consolidated? Notice that this is not an IT question. The business user cares about the answer. He will want to know what solution was chosen. The solution to the apples and oranges problem becomes part of the business meta data discussed above.

At the next level are the data warehouse builders and the technical meta data they create in order to build the system. Strictly speaking, this is just traditional system documentation. The designer lays out the programs and data stores through which data flows from source to user. He then documents this somehow (or he doesn’t). What’s new about that? What’s new is the spotlight that data warehouse applications throw on this meta data. We are finding unusual business value in having this meta data both exist and be retrievable. Here’s why:

Business meta data queries. Remember, the business user is now looking for meta data about unfamiliar information. When the business user frowns, "This number can’t be right. Where did it come from?", he triggers a technical inquiry into source data tables, extract programs, cleanup algorithms, and so on. It’s no longer garbage in, gospel out. It’s garbage in, technical meta data inquiry out. Of course, the business user doesn’t do that inquiry himself. IT does.

Expansion and Reuse. Data warehouses are extremely popular. If you are successful, you will soon be expanding the data warehouse. Or, if you build a parallel warehouse instead, you will want to reuse that extraction logic that takes data from the personnel system. Either way, there’s value in having the technical meta data readily available.

Lastly – having nothing particular to do with data warehouses – component reuse is coming to the industry. Distributed objects, CORBA objects, DCOM, ActiveX controls, Java Beans – all of these are aiming at a world of plug-in software development. Meta data about those components is absolutely vital to the success of these efforts. The components must be described and cataloged. When included in a system design, that dependency must be recorded. With such extremely complex relationships, meta data management becomes vital.

2. Managing Business Meta Data
The management of business meta data comes into play when the organization looks beyond its collection of stovepipe systems.

The Stovepipe Approach
Any given standalone system doesn’t have much requirement for meta data, much less the management of it. We usually find that:

	The users know the definitions of the data elements, allowable codes, and so on.
	Local work-arounds are implemented and taught by word of mouth (such as that state code "XX" means that the work is being billed to that special project in Germany).
	The users don’t know how certain calculations were done, but the output seems OK, so life goes on.
	If the system needs inputs from another system, a point-to-point bridge program is built. The transformation logic is built into the code. The design documents are mislaid. No attempt is made to create a generalizable solution to the data transformations.

This is not bad. It is a perfectly workable state of affairs in a standalone world. It has the great virtue of being cheap to build. It is, however, expensive to maintain and to integrate.

The Central Planning Model
The late 1980’s saw many efforts to remedy the stovepipe problem by going full over in the opposite direction: central planning. The enterprise’s systems will be redesigned from the top down. If Customer is modeled and defined centrally, then all systems will use the same definition (and perhaps the same database). Integration will become easy. The apples and oranges problem will go away, along with the complex and numerous bridge programs that consume so many maintenance dollars. IBM’s AD/Cycle and MVS Repository were children of this idea, as were Information Engineering, Texas Instruments’ IEF CASE/development product, and many others.

Acton Burnell worked on several such projects in those days. Coming in at the working level, we quickly learned that the core idea is elegant, but impossible to execute. More precisely, we learned that:

	Standardization costs.
	Not everything needs to be shared.
	Where there’s no business requirement for sharing, standardization has no payoff and will not get done.

However, along the way, we learned the basic skills of data integration, skills now much in demand for data warehousing. These skills include data analysis, semantic reconciliation, data administration, and repositories. And of course, we learned to gauge what’s possible and appropriate in a particular organization.

What Works

Choosing How Much to Bite Off
Regarding meta data, every organization continues to face the same fundamental tension that we saw in the stovepipe model vs. the central-planning model: how much to bite off. We see this tension in articles advocating the corporate data warehouse as opposed to departmental data marts. The advantage of the big warehouse is that it allows comparisons and roll-ups across a large part of the organization. But it’s also much harder to build. (It’s bigger, and it gets into a lot more people’s business.) By contrast, the departmental data mart has less value, but also less cost and risk.

There is no right answer to this choice, but it should be done candidly before deciding what metadata to manage and how to do it. A clear view of the business value, costs, risks, and politics of the program is essential.

The U.S. Department of State was implementing a data administration program to improve data sharing across the department. They were having trouble implementing a top-down program. We helped them re-orient their program toward one that was a better fit with the highly autonomous bureaus within the department.

We are helping an international snack food company implement a corporate inter-system data transfer facility. Using it, existing computer systems that participate can exchange data using standard definitions. This will greatly reduce the inventory of bi-lateral bridge programs that must be maintained. Our contribution was to know how much to standardize, as well as the nuts and bolts of how to do it.

Publish, Publish, Publish
We believe that publishing is the overlooked part of meta data management. By publishing definitions to the business users, you deliver value to them. This builds support for the program. Too often, the meta data stops with the software development team. They write down the business meaning of the data they need. They work out the extract algorithms and the clean-up logic. They build the system (data warehouse or other). Then the meta data disappears from view.

For a commercial client, we built desktop software that allows business users to look up the definitions of commonly used business terms and product names. Although a side product of our meta data management work, this proved very popular, and a key to retaining funding until the larger effort showed results.

For the IT shop of USF&G Insurance Company, we fleshed out the vision for a "System Component Center". The heart of this was meta data catalog of reusable IT components at every level. We helped pick the repository tool that held the catalog. We also developed materials for a publicity campaign to the IT staff to promote use of the catalog. All of our work came from our belief in the importance of publishing meta data.

3. Tools for Business Meta Data
Tools are essential for collecting, storing, and publishing business meta data.

Collection Tools
For collection, we use data models and modeling tools. Even the smallest data warehouse project has more business meta data than people can easily remember. And the nuances are important.

We usually begin with a CASE tool (e.g., ERwin, from Logicworks). Although the CASE tools do not force the modeler to enter definitions, it is essential that she do so. Information, expertise, and motivation all come together at this point. The definitions of things (and their allowable values, and their appropriate names) must be captured at this point.

Storage Tools
CASE tools have historically made poor catalogs. That is, if you want to look up the definition of Gross Sales Amount, the CASE tool does not present a list of items from which you choose. Instead, you must first find the item in the model diagram.

Furthermore, the CASE tool does a poor job of storing information about data transformations. For example, if Gross Sales Amount in the data warehouse derives from a flat file input calculated against a table of foreign exchange rates, that logic is hard to store in the data model. Indeed, the flat file may not be representable at all.

This is where a repository can be useful. These products are designed specifically to store meta data –especially all of the relationships between items – in ways which are queryable.

Acton Burnell are repository experts. We have built and sold our own low-end product (PC Dictionary). We have set up, programmed, and used all of the major repository products (Rochade, Platinum, DATAMANAGER, etc.) We have also created limited-function repositories in Microsoft Access where that’s been sufficient.

Publishing Tools
Today, vendors of some warehouse query tools advertise that users can "click on a report column and get the business definition". This is excellent. However, the movement of the business meta data from its origination point (the CASE tool) into the query tool is less than smooth. Likewise, the management of that meta data over time is not smooth. The result is often duplicate stores of meta data.

For a large insurance company’s Claims data mart project, we developed a Microsoft Access application which published all relevant business meta data in the data mart. This repository of meta data was essential to the success of the data mart. Without it, the business users had no faith in the integrity of the data in the mart. With it, they could get, not only business definitions of the data items, but also detailed explanations of source-to-mart data transformations.

For one small data warehouse, we published meta data as a Windows Help file embedded inside the client/server warehouse viewer. The meta data came directly from the ERwin data model into Word, and then into a Help file.

For our PC Dictionary product, we sell a client/server repository browser. Designed as an easy-to use manager’s tool, this allows anyone to look up meta data in the repository with no training.

We look forward to seeing better integration between the tools that store business meta data.

4. Managing Technical Metadata
The bad news is that technical meta data is even more fragmented than business meta data (and more voluminous). The good news is that vendors in this arena acknowledge the role of technical meta data and have partial solutions.

Because of our extensive experience with large repository products, we do not see these as suitable for managing technical meta data. Although these tools can hold and report the meta data, loading the meta data is difficult. Repository import programs are expensive, quirky, and limited. Worse, if items are separately loaded, their relationships must be manually set in the repository. This is workload that no IT shop will sustain on any scale.

Extract
Here, we have been impressed with a product called DataStage from Ardent (formerly V-Mark.) It acts as a shell around the entire extract, clean, transform, and load process. It automatically loads meta data regarding source and target structures. It allows the developer to express the overall flow graphically, with drill-downs into the detailed transformations that must happen. It includes it’s own coding language for simple logic, and the ability to call external routines for more complicated transforms. Furthermore, it’s repository-based, so design meta data about the entire supply-side of the warehouse is stored in one place. However, the repository is proprietary, useless outside of DataStage itself.

Clean
Data being fed into data warehouses is notoriously dirty. Popular estimates are that 60% of a warehouse development project will go into handling data quality problems. That matches our experience. We begin the project by assessing the quality of the data sources visually.

In all cases, we begin with a visual scan of the source data. Where the data is relational, we use Microsoft Access.

More thorough examination of data quality requires automation.

We have done data quality audits for many clients. We often use a product from Prism called PQM (Prism Quality Manager). We have packaged these data quality services into an offering called PerAudit.

For runtime operation, we do not use PQM. We manually code the necessary quality checks and actions.

Ideally, this runtime code could include salvage from the meta data used by PQM. Unfortunately, PQM (like others) stores this quality-check meta data in its own proprietary repositories. It also expresses the meta data using a proprietary language. Thus, the meta data used to evaluate quality up front (via these tools) cannot be moved into the production system directly.

Transform
This is DataStage’s strength. Other products in this area come from Prism, ETI, and Carleton. Prism stresses that it is repository-based. Again, though, the repository is a proprietary one.

Load
The fastest database loads use loaders supplied by the DBMS vendors. To operate, these require meta data instructions (take this string and put it in field "X"). We have not found a product which can generate these loader instructions (although tools such as DataStage bypass the loaders).

5. The Microsoft Repository
In the spring of 1997, Microsoft released the Microsoft Repository, version 1.0. It bundles the product with the high-end version of its Visual Basic 5.0 product.

Our hope, widely shared, is that Microsoft will have enough clout to bring standardization to the fragmented world of tool repositories. Already, dozens of tool vendors have announced plans to make their tools compatible in some fashion with the Microsoft Repository. At the same time, Microsoft is working with industry groups to extend the tool’s metamodel. Among the enhancements will be better support for data items. (The initial release is heavily weighted towards program artifacts.)

Today, the Microsoft Repository is not a useful tool for meta data management of the kind discussed here. However, in four years, it might be. With that done, business meta data created in a CASE tool might progress cleanly to the user’s desktop. And technical meta data created by the system designer could be easily retrievable by the maintenance programmer tracking down a suspect derivation. And the repository itself could hold the official meta data. Or, perhaps, none of this will happen, any more than it did with the IBM Repository. We are paying close attention to developments along this front.

1500 N. Beauregard Street, Suite 210 • Alexandria, VA 22311-1715
Phone: (703) 671-0700 • Fax: (703) 671-8938
Email: webmaster@actonb.com