Introduction to Data Warehousing

A Data warehouse turns raw information into a useful analytical tool for business decision making. The fundamental question that gets addressed in about 1000 firms is: “Which customers are buying or using what products, services, when and where?”. If you know the answer to that question, then you are able to guide your business strategically.

Transaction processing systems can also play a strategic role in gaining competitive advantages for a business. Many firms are using the internet, extranets and other networks that tie them electronically to their customers or suppliers for real time or online transaction processing (OLTP). It is also a real time transaction processing system.

Companies often keep raw information in online transaction processing systems, which track day to day operations such as- each sale, purchase and inventory change. But OLTP systems are not well suited for answering questions that affect the past, present and future direction for a business question like: What are the historical trends in unit costs vs. growth in sales to customers in Orissa?

To answer those kinds of questions, a company needs an analysis system with the ability to perform ad hoc queries and create specialized reports. The raw material for analysis is a combined view of all the relevant data a company has – a data warehouse. The warehouse stores information from OLTP systems and other sources of raw data are external systems.

Metadata blueprints: Because information is coming from many sources, each with its own view of the data, a company must create an enterprise level data model to have a consistent view of its information. This metadata is the blueprint for the pieces of the data warehouse architecture.

The process of transforming raw data into a data warehouse involves several steps: extraction, consolidation, filtering, cleansing, conversion and aggregation. This process is collectively known as data warehouse generation. Generation is at the heart of the warehouse infrastructure, most of the efforts in a data warehouse project is spent on this process. Doing it right means the difference between finding answers that are valuable and answers that are useless. Here is what each of the steps involves:

Extraction – This step involves taking the data out of its original database and transforming it to the data warehouse infrastructure. Companies often place restrictions on what is extracted. For example, the extraction process may occur every day, so any changes to the raw data sources older than 24 hours are ignored.

Consolidation – It is the process of combining data from several sources into one database. To get a complete view of a customer, a company may consolidate data from older entry systems, sales contact database and technical support databases.

Filtering - Not every piece of data is needed. For example, a company may want to know which products customers have ordered but do not need the confirmation number used to process a sale. Filtering picks out the relevant data and removes duplicate entries.

Cleansing - The quality of an answer is only as good as the quality of data used to derive that answer, so it’s important to cleanse the data to improve the accuracy of the data in the warehouse. The classic example of poor quality data is a single customer with multiple entries. It requires some intelligence on part of the cleansing software to identify and correct such data.

Conversion – Conversion is also called translation, means mapping raw data onto new data fields within the warehouse data model and translating the data into the format used by the warehouse. For example, the original supplier data may count widgets by the gross, while manufacturing data tracks widgets individually. In a data warehouse, the units of measure must be same in order to get useful analysis results.

Aggregation – Often the value of a data warehouse is in the summarized data and derived data it contains, as opposed to the raw data stored in OLTP sources. The aggregation step sorts and combines data into useful metrics for analysis. For example, while the raw data may track individual orders by individual customers, a more useful measure of sales might be orders of a particular product family. The aggregation process generates these new calculated sales numbers.

It is a large centralized storage area such as a database. When an organization centralizes the storage off data, it is said to be putting the data into a data warehouse. These warehouses contain millions of pieces of information about customer behaviour and demographics, and they are starting to contain information about other personal traits and behaviours. Even though the phase ‘warehouse’ implies that the data is shelved, infact the storage spaces are often nothing more than the servers that contain large arrays of disk for storing information.

The scalable data warehouse framework is a complete view of data warehousing. To grasp data warehousing fully, it is important to understand the data warehousing process. Data warehouses, in essence, store and access data supplied by OSSs (Open System Specification) and provide data for tools and applications.

The source data that will populate the data warehouse solution will come from OSSs. There are two major categories of operation systems: network systems, including planning and engineering, provisioning, network management and trouble/repair and business operations systems, including customer care, billing and directions.

A data warehouse will deliver network planning and analytical processes to support network asset optimization, provide cross-geographical views of critical network data and deliver the integration of both customer and network data to improve and enhance profitability. This solution will help communications service providers manage the evolution of their networks so that it is robust enough to support new, diverse services in a cost effective manner while addressing capacity and planning issues for basic network services.

As one might expect, the heart of a data warehouse is the data itself. The type and quantity of the data stored changes almost on a daily basis. Change is inherent to the data warehouse. This data changes as the business grows, as operational data changes, as the business question change, as the number of users changes and as the applications used to access the data change.

If a warehouse is built around a database engine that cannot handle its dynamic nature, failure is inevitable. This failure could force users to limit the questions they ask, result in a long delay for the return of queried data, and ultimately require the construction of separate, independent systems. Finally, a failure could necessitate the use of summarized data instead of detailed data and providing detailed data is quite possibly the most important role of the data warehouse. The capture of details data enables businesses to drill down and perform analysis on subject oriented data and get insight into the entire business. Most importantly businesses will learn from their mistakes and successes.

Data Marts and Operational Data Stores

A data mart is specialized set of business information focusing on a particular aspect of the enterprise, such as a department or business process. The information is a data mart often comes from several different raw data systems. Many companies choose to feed a data mart from a data warehouse because the in the warehouse has already been consolidated and processed from the same raw data.

An operational data store is a hybrid of an OLTP system and an analytical system. It contains information that is frequently updated in an ad hoc basis, often in response to changes in the OLTP system, as opposed to the scheduled updates of a data warehouse. The data within an operational store mirrors some of the data within the OLTP system. It has been extracted from the OLTP system and transformed and aggregated to a limited extent. Its purpose is to provide an operational level query system that won’t affect the performance of the raw OLTP systems. Databases used for transaction processing are designed to update thousands of records per second but are not designed for sophisticated querying. Data warehouse databases are designed to analyse terabytes of data and billions of records. They are organized to better allow analysis using special techniques.

Online Transaction processing involves real time transactions. It has been recognized that this data combined with current data, contains an enormous amount of information from which one can discover trends that would never be seen on a day-to-day or month-to-month basis. To make this data more useful, it is now being stored in a separate database called a data warehouse.

Written by Sameer.
In category New Technologies.
14 years ago.

Hot Hits: 1994

Like it on Facebook, Tweet it or share this article on other bookmarking websites.

Introduction to Data Warehousing

Data Marts and Operational Data Stores

No comments

Random Poll

Who will be our next P.M. after 2019 election?