By Colin White
Most techniques used by organizations to build a data warehousing system employ either a top-down or bottom-up development approach. In the top-down approach, an enterprise data warehouse (EDW) is built in an iterative manner, business area by business area, and underlying dependent data marts are created as required from the EDW contents. In the bottom-up approach, independent data marts are created with the view to integrating them into an enterprise data warehouse at some time in the future. There is much debate in the industry about the pros and cons of the two approaches. In reality, there is a steady trend toward the use of independent data marts; and the move toward the use of turnkey analytic application packages will accelerate this trend.
What organizations require is a solution that offers the low cost and rapid ROI advantages of the independent data mart approach without the problems of data integration in the future. To achieve this, the design and development of independent data marts must be managed and must be based on a shared business model of an organization's business intelligence requirements. This hybrid solution is called a federated data warehouse (see Figure 1). Two key components of a federated data warehouse are the common business model and shared information staging areas.
The federated data warehouse approach supports the iterative development of a data warehousing system containing independent data marts. The key to data integration in a federated data warehouse is a common business model of the business information stored and managed by the warehousing system. The creation of a common business model ensures consistency in the use of data names and business definitions across warehousing projects.
The common business model is updated as new data marts are built. When data mart design is dictated by the data that exists in operational systems, the common business model is used, and updated as appropriate, in parallel with the development of underlying data mart data models. In an environment driven directly from business-user business intelligence requirements, the common business model is first developed and then used to create one or more underlying data mart data models. Several vendor products support a business model-driven approach to data mart development. If packaged analytic applications are used, it is important that the business models used by these applications can be customized and integrated with the organization's common business model.
Figure 1: Federated Data Warehouse
The use of vendor- supplied customizable business area templates can significantly reduce the effort involved in developing the common business model and its associated data models. A business area template documents the business metrics and underlying business rules that are typically required by users when analyzing and modeling a specific business process. The business area template often comes with a data model that provides a customizable data warehouse design.
Another use of the common business model is to aid in the deployment of business intelligence tools and analytic applications. Many of the more sophisticated products use a set of business views, or a business meta data layer, to simplify access to warehouse information by business users. The creation and maintenance of this meta data layer is often complex and time- consuming. The common business model can help in the creation of this layer since it documents both business intelligence requirements and also how those requirements are satisfied by information in the data warehousing system.
When a new independent data mart is built, developers typically create a new suite of data extraction and transformation applications that are rarely integrated with applications used for building other independent data marts. The net result is that as the use of independent data marts increases so do the numbers of extraction and transformation routines.
Maintaining extract and transformation routines is resource-intensive, and coordinating their execution at runtime is an operational nightmare. The solution to this problem is to break the processing into multiple steps. This involves developing a set of routines that extract the source data and load it into shared information staging areas. The detailed data in the staging areas is then used to feed data into independent data marts. As data flows out of the staging areas, it is enhanced and mapped by ETL tools into the format required by the target warehouse information store.
As new data marts are added to the data warehousing system, existing extraction routines and staging area data can be reused or enhanced as required. This technique works very well in a federated data warehouse environment where the common business model can be used in the design of the staging areas and data extraction routines.
The use of staging areas can also help in the task of fixing source data quality problems. Many organizations are finding that ETL tools cannot always deal with many of the quality problems that often occur in source data. To solve this issue, data profiling tools can be used to analyze and identify problem areas in source data prior to ETL tool processing. If data profiling identifies quality problems in the source data, then data reengineering tools can be used to fix source data after it has been extracted into the staging areas. Clean and consistent staging area data is then used to populate the various information stores that make up the data warehousing system.
Data profiling and warehouse design tools can be used in a series of iterative steps to develop the design of the staging area and to identify the rules for mapping, cleansing and transforming the source data into the format required by the staging area. The basic design elements of the staging area are business components, which provide a business view of the detailed data in a source system. An example of a business component is a purchase order that represents data stored in multiple source system data files.
Using data profiling and data reengineering tools ahead of ETL tool processing reduces the likelihood of data quality problems occurring in the data warehouse, and can dramatically improve project success rates and the return on investment of data warehousing projects. This can be thought of as analysis-led, rather than ETL- driven data warehouse development.1 It should be realized, however, that although the analysis-led approach is a significant step forward from current techniques, it is not a panacea and is not a substitute for addressing data quality problems in the source systems.
1 An Analysis-Led Approach to Data Warehouse Design and Development. DataBase Associates. January 2000
Colin White is the president of DataBase Associates and the founder of Intelligent Business Strategies. White specializes in business intelligence, corporate portals and intelligent e-business. With more than 30 years of IT experience, he has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. White has coauthored several books and written numerous articles on business intelligence, and database and Web technology for leading IT trade journals. White may be reached at cwhite@databaseassociates.com.
© Copyright 2002, Thomson Media