Enterprise-Scalable Data Marts:
A New Strategy for Building and Deploying Fast,
Scalable Data Warehousing Systems
Data Warehousing Today: The Expectation Gap
Data warehousing has clearly captured the imagination, and focused the attention, of many of the country’s largest organizations. Under ever increasing pressure to reinvigorate their competitiveness, seize new business opportunities and wring more from existing lines of business, "Global 2000" companies have been almost zealous in their eagerness to incorporate some form of data warehousing into their current IT strategies. According to the market research firm META Group, 90% of the Global 2000—a term used to describe the world’s largest 2000 companies—are today attempting some form of data warehousing project. And from the data being reported by those who have studied the experiences of these "early adopters," most are having a rather rough go of it. According to Earl Hadden & Associates, 85% of current data warehouse projects fail to meet organizational objectives, while 40% fail completely.
Does this mean that data warehousing has been oversold? Far from it. The need to harness disparate stores of corporate data and give managers ready access to the information required for effective decision-making is greater today than at anytime in recent memory. In a recent industry survey, CEOs were asked to rate their satisfaction with the information their managers have to make decisions. On a scale of one to ten, with ten being the highest score, the average response was two.
The need is clearly there. And data warehousing appears to show great promise as a solution, but there is a disconnect today between what companies expect to achieve from their warehousing projects and the actual results of their efforts. Here is what they should expect: the average centralized data warehouse takes about three years to build and costs $3-5 million. With these kinds of numbers it stands to reason why only companies with deep pockets have been able to undertake the risk.
But what about smaller companies? Or departments within larger organizations? With access to timely data such a critical requirement for sound decision making, why shouldn’t data warehousing be within the reach of every company and every workgroup? And why should line-of-business management—so pressured to produce results—continue to be beholden to technical staff to get at the information that could mean the difference between success and unemployment? Clearly, there has got to be a simpler, faster and less expensive way to implement data warehousing.
A New Approach: Enterprise-Scalable Data Marts
There is. It’s called the enterprise scalable data mart—highly focused line-of-business data warehouses that put targeted information into the hands of decision makers. Data marts are data warehouses, but unlike the highly centralized, multiple line-of-business warehouses developed in the past, data marts are single subject (e.g. sales, finance), fast and easy to implement. They offer organizations greater flexibility, without sacrificing the all-important requirement of enterprise scalability. In contrast to the COBOL code-generation approach required by many of the traditional data warehousing solutions, today’s "second-generation" data mart technology propagates data from either centralized data stores, or operational transaction databases using a real-time extract/transform/load engine, and features highly graphical client/server tools, integrated metadata (the critical "card catalog" that provides a universal overview of how and where data is stored )and innovative, centralized management capabilities.
The results are nothing short of astounding—data warehousing systems built and managed at a fraction of the cost of centralized implementations; completed in months, instead of years; with little or no consulting services required.
This white paper examines in-depth the enterprise-scalable data mart architecture, a technology poised to drastically reshape the economics and application of data warehousing. Included is a contrasting view of the features and benefits of a data mart architecture with those of a legacy data warehouse, as well as a brief overview of the PowerMart suite, the integrated set of software tools from Informatica Corp. for building, deploying and managing enterprise-scalable data mart solutions.
The Corporate Data Warehouse: Promise Versus Reality
In the past, corporate data warehouse implementations have been mainframe-centric, enterprise-wide in scope, and top-down driven by IT management. Under this architecture, everything is built at the enterprise level—one common data warehouse holding all of the decision support information for all user requirements. A separate database is maintained for the company’s operational systems.
These data warehouse implementations must deal with enormously complex issues mandated by their enterprise-wide nature. Everything must go through multiple levels of agreement because the data warehouse is designed for everyone’s use. The impact of a single change is enormous, because changing one thing changes everything. Companies must develop complex structures of committees and intertwining responsibilities to manage changes; too often, the net result is that it becomes impossible to make changes efficiently and backlogs develop.
It is not surprising, then, that centralized, corporate-wide data warehousing initiatives have experienced a high rate of failure. A recent report by Forrester Research’s Software Strategy Service offers a damning perspective:
"At many companies, the MIS backlog has ballooned to an infuriating 30-45 months. Not surprisingly, as much as 70% of the undone projects is stuff that MIS hates doing: building queries and reports against operational data. MIS had hoped that centrally managed global warehouses would solve the problem—but they don’t. The centralized data warehouses often become ‘databasements,’ unused and ignored by end users. They provide only big, expensive piles of homogenized, untargeted data that are tough to navigate, quickly get out of sync with production systems, and fail to address many end users’ needs." Source: Forrester Research
The poor record of success in data warehousing to date is attributable not only to the inherent complexity of corporate data warehouse architectures, but also to the lack of appropriate development tools. Until recently, these tools have been largely mainframe derived, and not suited to development in client/server environments. In addition, mainframe-centric data warehousing tools have been expensive, inflexible, slow, highly complex to learn and use, and lacking in key data management capabilities such as data transformation and metadata generation and handling.
Second Generation Data Warehousing Tools
Fortunately, data warehousing technologies are rapidly changing and developing. Companies like Informatica have emerged, offering next-generation products and technologies that make data warehousing faster and more economical to design, deploy and manage. These technologies are designed to more fully exploit the benefits of client/server technology, taking advantage of newer distributed platforms such as Windows NT, and UNIX-based client/server platforms. The vendors offering these new technologies are fueling the rising interest in data marts, a more focused and cost-effective approach to data warehousing.
The market for data mart software has rapidly grown to roughly $153 million today, and is expected to expand some 350 percent, to $690 million, in 1998. This dramatic rate of growth reflects an awareness by more and more corporations that have tried to tackle large-scale, enterprise data warehousing projects, that a shift to more manageable data mart implementations is warranted.
Enterprise-Scalable Data Marts: A More Productive Model
The principal distinctions between a corporate data warehouse and a data mart are architectural. A data mart architecture has the same essential characteristics as a corporate data warehouse, but instead of utilizing an enterprise-wide, multiple line-of-business approach, a data mart is typically enterprise wide but focused on a single function area, often an individual department such as sales or finance. They also include enterprise management capabilities and integrated tools for viewing and organizing metadata.
Because they contain much smaller volumes of data than the corporate data warehouse, individual data marts are also easier to set up and manage. And building a data mart is a faster, more interactive and iterative process, because the data mart is driven by solving user problems.
It is important to note that data marts need not be restricted in size. Again, the differences lie in the architecture, not the size of the project. Data marts are easily scalable in size, from personal to workgroup to departmental to corporate. They are also inherently enterprise-scalable; i.e., multiple data marts, scalable to the requirements of each application, may be rapidly deployed across the entire enterprise to provide a total data warehousing solution.
Because data mart solutions are inherently distributed and more efficient to implement, they are considerably less expensive to deploy. Whereas a centralized data warehouse for a Global 2000 company might cost anywhere from $3-5 million, a departmental data mart can typically be built for $100-150K, and a workgroup-level data mart for around $40K. The data mart handles less data and runs on more cost-effective platforms. In addition, ongoing costs for software, training, maintenance and support are substantially lower. Even ten data marts cost less than a single legacy data warehousing implementation, and they can be implemented more quickly.
For example, a company might concurrently build a series of data marts—say for human resources, sales, marketing, finance, and human resources—each taking three to six months. So, in 9-12 months, for around $1.5 million, that company has improved its processes across the board, and experienced a much faster return on investment.
Data Mart Benefits
The data mart architecture offers several benefits due to its scalability and incremental approach. For example, an ad hoc data mart can be quickly built to address a pressing business need. At the extreme, a "quick-strike data mart" can be built to answer a strategic question. This flexibility and simplicity stand in sharp contrast to the complexity inherent in attempting to address a wide range of often conflicting user needs in building a corporate data warehousing. Business problems and needs change faster than corporate data warehouses can react to them. Two years of development are too late in fast-moving competitive situations.
End users benefit from data that is customized and focused toward their needs. And data marts, because they enable development and administration at a departmental level, reduce the support burden on the corporate IS staff.
Who is Buying Data Marts?
The ideal candidate for a data mart solution is a workgroup or department within a large corporation. The data mart is likely to serve fewer than 50 users initially, but is likely to grow to more than 100 users in the future. Data mart size is typically less than 50 GB, but may well grow to exceed 100 GB. The data normally comes from a restricted set of sources.
Individual data marts are usually line-of-business driven, targeted to solve particular business problems. Since most data marts reflect strong business urgency, they must be implemented quickly and at low cost; typically the data currency requirements call for frequent, often daily, refreshment.
Many data mart implementations are done by global companies; these companies often grow by acquisition and have decentralized or conglomerate structures that lend themselves naturally to the incremental enterprise-scalable data mart architecture.
Finding The Right Tools
Until recently, companies trying to build data warehousing solutions have had few viable options. Many of the initial data warehousing projects—the ones failing to meet users’ objectives 85 percent of the time—have relied on 3GL and 4GL development tools. Essentially, they manually developed a one-off piece of software that would build and manage a data mart. Although they are powerful in other application development contexts, these tools are not data warehousing-specific, and programmers must undergo a severe learning curve to adapt them to data warehousing functionality.
Importantly, this manual approach cannot generate metadata. The metadata essential to managing the data warehouse must be laboriously added after the fact, adding substantially to the time and cost of deployment and increasing the possibility of creating flawed metadata. And because 3GL/4GL code is not naturally suited to the data warehousing process, personnel turnovers can be disastrous when knowledgeable and experienced developers leave the company.
The first generation of data warehousing tools, code generators from companies such as Prism, Carlton and ETI, represent a step forward from manual 3GL/4GL approaches, but they are still incomplete. These tool sets do not really match up to the complete process of building a data warehouse, although they are more "aware" of data warehousing issues. While they are somewhat data warehousing-capable, they tend to be very complex, difficult to use and expensive. In addition, they are not fully integrated, and as is the case with 3GL/4GL tools, they require that the metadata be retrofitted. For data warehouses built with those code generators, ongoing management and maintenance are major cost elements.
Informatica Corp. is credited with leading the second generation of data warehousing tools. "They have radically changed the data warehouse market. The approach prior to theirs was IT driven, high priced, and mainframe centric. It was a '70s or '80s approach. Informatica has shattered the price myth and broken the mold." (META Group) Informatica has devised a four-phase approach to building data mart solutions called the Data Mart Lifecycle, which applies equally as well to legacy-based data warehouses.
Until recently, there have been no integrated tools for building client/server based data marts. What has been lacking is an integrated product suite with a repository, a tool set that automatically does needed programming that matches the production cycle and automatically generates metadata. These data mart development tools must also address the critical areas of ongoing management and enterprise scalability. With its PowerMart suite, Informatica is leading the way to provide that needed data mart solution.
The PowerMart Suite: An Overview
With the PowerMart suite, a user can create a data mart prototype in days. Then, by using an iterative design methodology to provide enhancements, a production data mart can be created in weeks. This is accomplished through the product’s inherent ease of use and its client/server architecture. Windows drag-and-drop capabilities and Wizards are utilized throughout the various end user tools, easing the learning curve and increasing productivity. There is no proprietary fourth generation language to script the mappings and transformations. Instead, the
PowerMart suite relies on a visual interface combined with SQL-enhanced transformation expressions. In addition, users can also leverage native database scripting languages, such as Oracle’s PL/SQL and Sybase’s Transact-SQL.
Users interact with the PowerMart Server by engaging the Informatica PowerMart suite’s client software tools running on Microsoft Windows PowerMart Designer, PowerMart Server Manager, and the PowerMart Repository Manager/Metadata Browser. With these tools, users have at their fingertips all the necessary resources required to create and manage an enterprise-scaleable data mart. The typical steps followed to create and administer the data mart include setting up the PowerMart Repository, creating the data mart design, initiating the loads, managing the PowerMart Repository and administering the warehouse.
Informatica PowerMart suite applies an automated approach to arrive at a prototype data mart very quickly. This is then refined in the light of real data and user experiences to build an optimal design. By using integrated PowerMart components which share metadata through a common repository, the extraction, transformation and population of the data mart is greatly simplified. Rapid deployment enables a quick and cost-effective decision support solution.
The Informatica PowerMart suite is comprised of four components, which facilitate the design and management of enterprise-scalable data marts. They include:
Informatica PowerMart Designer, the tool for visually defining mappings and transformations;
Informatica Repository, the open metadata store for definitions about the mappings, transformations, and data marts;
Informatica Server Manager, the tool for configuring and monitoring the warehousing process;
Informatica PowerMart Server, the multithreaded server-based engine that maps and transforms data from the operational system, and then loads it into target data marts.
An optional, complimentary product is Informatica PowerCapture, which captures changes incrementally as they occur in the operational systems, and feeds those changes into the PowerMart Server engine.
PowerMart Designer is made up of three integrated modules: Source Analyzer, Warehouse Designer, and Transformation Designer.
Source Analyzer--The Source Analyzer reads schema information from the operational systems in order to determine the formats and structure of the operational database. It analyzes RDBMS catalogs, as well as COBOL source code and database definition scripts (e.g. IMS DBDGENs), to extract data structure information. The schema information extracted from the operational database is stored in the Informatica Repository and is used by other PowerMart components.
Warehouse Designer--The Warehouse Designer is used to design the data mart model. A major feature of the Warehouse Designer is the Star Schema WizardTM, which allows designs to be made rapidly by leveraging the operational schema design and stepping the user through the creation of the warehouse design. Often, a star schema is employed in order to provide high performance for queries running against the data mart. A simple star schema is comprised of one fact table and several dimension tables.
The schema is presented using a visual interface. If necessary, the created data mart database design can be modified further. PowerMart provides dialog boxes and drag-and-drop physical table design capabilities to make these modifications. Indexes and additional table constraints can also be specified. Once the modifications are completed, the database designer will generate the necessary SQL Data Definition Language native to the target database and create the database tables, indexes, and constraints. The Warehouse Designer also allows versioning of the warehouse designs, thus enabling several versions of the warehouse design to reside in the PowerMart Repository. For example, a production and prototype design can be simultaneously maintained.
Transformation Designer--The Transformation Designer visually displays the operational database-to-data mart mapping that is automatically created by PowerMart. The mappings can be modified or deleted using drag-and-drop techniques. Each structure and field in the operational database that contributes data to the data mart is shown here, as are the target table(s) and field(s) in the data mart. The flow of data can be conditional and modified during the translation. This is especially important when creating aggregate and summary tables, as well as computed tables that contain data from one or more operational tables.
PowerMart Server Manager
The Informatica Server Manager is responsible for configuring, scheduling and monitoring the Informatica Server. It operates in one of three modes: Session Configuration, Session Monitoring and PowerCapture Configuration.
The PowerMart Repository is the metadata integration hub of the Informatica PowerMart suite. The metadata information stored in the PowerMart Repository is utilized and viewed by the Repository Manager and the Metadata Browser.
Repository Manager--The Repository Manager is used to create and maintain the PowerMart Repository and the metadata that is contained within it. Within the Repository there are different levels of detail:
Subject Areas--A high level logical grouping of data covering various subject areas within a company. Examples are customer, vendor, product, and human resources.
Mappings--Within each subject area there are one or more mappings. Mappings are the business rules that the source data must go through before being populated in the data warehouse. Examples are SUM(:tbl1:fieldA), and (:tbl2:fieldA * :tbl3:fieldB) / 2.
Objects--Within each subject area there are one or more objects. Objects are the actual detail that the typical user will interact with. Examples are tables, fields, and mappings.
Metadata Browser--The Metadata Browser is used to browse and view the metadata in the Repository. The objects in the Repository are displayed in a graphical tree structure.
Informatica PowerMart Server
The workhorse component of PowerMart is Informatica PowerMart Server, which runs on a UNIX or Windows NT platform. It consists of an Extractor, Transformation Engine and Loader. Each of these three processes operate independently, and take full advantage of hardware server parallelism for high data throughput and scalability.
Extractor--The Extractor is comprised of three different sub-components. The File Reader is used to read sequential files which have been offloaded from IMS and VSAM sources. The Relational Reader extracts data directly from the supported RDBMS’s: Informix, Oracle, MS SQL Server, and Sybase. The last component is the PowerCapture Reader, which is used to extract all of the changes captured from the following databases: DB2, Oracle, and Sybase.
Transformation Engine--Once the operational data is brought into the Informatica PowerMart Server, the Transformation Engine performs on-the-fly transformations as the data is read from the operational database and propagated to the data mart. Samples of transformations include expressions, field level conditions, table level conditions and summary creation.
Loader--The Loader efficiently writes the transformed data directly into the data mart. It utilizes the bulk load capabilities in the database if available, otherwise it uses an array load mechanism. While propagating the data, the Loader controls the commit points of the data, synchronizes the Extractor to manage the load as a unit of work, and outputs the rejected records to an exception file.
Informatica PowerCapture allows a data mart to be incrementally refreshed with changes occurring in the operational system, either as they occur (for critical real-time warehousing), or on a scheduled basis. This is accomplished by using the log mechanisms in the operational database to extract the information. This ensures that the on-line performance of the operational system is not adversely affected. PowerMart has the ability to extract only the changes that have occurred to the operational sources. This is known as PowerCapture Configuration.
For example, take an operational sales table that contains 1,000,000 records and grows at a rate of 10,000 records a week. If a complete refresh of the warehouse table was done on a weekly basis, 1,010,000 records would need to be propagated into the warehouse after the first week. If PowerCapture was employed, only 10,000 records would be populated into the data warehouse. In essence, the amount of throughput into the data warehouse is reduced enormously by leveraging this capability.
An Open Architecture
Designed to operate in open systems environment, the PowerMart suite is built around an open architecture. Informatica provides an open API designed to enable third parties to write aftermarket products to work with the PowerMart suite. Unlike refurbished code generators that have their origins in mainframe computing, Informatica’s PowerMart suite has been built from the ground up to support easy integration with leading relational and multidimensional databases, popular end-user access and query tools, and other third-party software.
PowerMart’s published, open APIs allow third-party products to access and share Informatica’s metadata, the critical "card catalog" that provides a universal overview of how and where data is stored. With the APIs, users of leading decision support tools, like BusinessObjects, and popular desktop software, such as Microsoft Access or Excel, can easily find critical business information using PowerMart metadata as a handy roadmap.
Informatica PowerMart uses leading relational databases as the repository for storing and managing warehouse data, as well as metadata. Target databases include Oracle, Sybase, Informix, and Microsoft SQL Server.
Another example of a third-party vendor who integrates with PowerMart’s open APIs is Business Objects, a leading provider of enterprise decision support software. Business Objects pioneered the market for business-intelligent decision support tools in 1990 with the introduction of its flagship product, BusinessObjects. To date, BusinessObjects has been licensed in more than 40 countries to over 2,600 customer sites and 130,000 users worldwide. Through their integration with the PowerMart suite, a user can seamlessly view PowerMart metadata through a Business Objects Universe.
Enabling the Enterprise-Scalable Data Mart: The Vendor Challenge
In order to maintain and increase the present momentum for the adoption of data mart solutions by corporate users, software vendors must focus precisely on their customers’ needs. The industry must work to ensure viable standards that will enable a pool of reliable software tools and give customers good choices for developing and deploying enterprise-scalable data marts.
These tools, at a minimum, must provide integration, graphical simplicity and the level of user friendliness companies have become accustomed to with client/server products. The tools must also enable just-in-time data delivery, giving customers timely information in easily understandable formats. The new software technology must support the full data mart development process, from design to extraction and transformation, with integrated metadata as the foundation for complete management and maintenance capabilities. The proof of those qualities will be reflected in the scalability of the resulting data marts: how quickly and cost-effectively companies can replicate them across the enterprise.
Today, Informatica is leading the way in providing tools that make it fast and easy to develop enterprise-scalable data mart solutions. The PowerMart suite is designed specifically for rapid deployment, broad scalability and comprehensive manageability of data marts. It supports all of the major processes involved in building a data mart, integrating its complex functionality through a consistent GUI for all products.
The enterprise-scalable data mart is an idea whose time has come.
Informatica’s goal is to give substance to that idea.