Lecture 12 – Data Warehousing and OLAP for Data Mining

Lecture 12 explains Data Warehousing & OLAP for Data Mining, covering architecture, ETL processes, star/snowflake schema, multidimensional models, OLAP operations, cubes, case studies, and real-world applications.

Data Mining is only effective when the underlying data is integrated, clean, consistent, and well-organized. In enterprise environments, data comes from dozens of systems billing, CRM, ERP, social media, apps, IoT sensors, and external sources.
To analyze this information efficiently, organizations use data warehouses and OLAP systems.

This lecture provides a deep understanding of:

  • Data warehouse architectures
  • ETL pipelines
  • Multidimensional cubes
  • OLAP operations
  • Schema design
  • Industry use-cases

It prepares students to handle real-world analytical systems used by banks, hospitals, retail giants, governments, and tech companies.

Introduction to Data Warehousing

Why Data Warehouses Exist

Operational databases (OLTP) are designed for transactions, not analytics.
Example OLTP tasks:

  • Insert new orders
  • Update stocks
  • Process payments

But they cannot efficiently handle:

  • Trend analysis
  • Sales comparisons
  • Seasonal behavior
  • Multi-year predictions

That’s why Data Warehouses are built.

Differences Between OLTP & OLAP

FeatureOLTPOLAP
PurposeDaily operationsAnalysis & Decision Making
QueriesSimple, shortComplex, long
SpeedHighMedium
Data TypeCurrentHistorical
UsersClerks, CashiersAnalysts, Managers
ExampleATM systemBusiness analytics dashboard

Data Warehouse Architecture

Single-Tier Architecture

Goal: Minimize redundancy by storing everything in one place.
Not commonly used.

Two-Tier Architecture

Client ↔ Server
Used for small-scale warehouses.

Three-Tier Architecture (Most Common)

Tiers:
  1. Bottom Tier (Data Warehouse Server)
    • Stores integrated warehouse data
  2. Middle Tier (OLAP Server)
    • MOLAP
    • ROLAP
    • HOLAP
  3. Top Tier (Front-End Tools)
    • Dashboards
    • Analytics
    • Reports

Components of a Data Warehouse

Data Sources

Examples:

  • ERP systems
  • CRM
  • Web logs
  • IoT sensors
  • Social media

ETL (Extract, Transform, Load)

Extract

Pull data from source systems.

Transform

Clean, standardize, convert, merge.

Load

Insert into warehouse.

Text-based ETL Diagram
Source → Extract → Transform → Load → Warehouse → OLAP → Reports
Staging Area

Temporary storage for transformation before loading.

Metadata Repository

Stores:

  • Data descriptions
  • Business rules
  • Table mappings
Data Marts

Small, department-specific warehouses.
Examples:

  • Marketing mart
  • Finance mart
  • Sales mart

Lecture 11 – Ethical Legal and Privacy Issues in Data Mining

Multidimensional Data Model

A warehouse organizes data using cubes.

Cubes

A cube is a multidimensional dataset used for analysis.

Example Cube Dimensions:
  • Time
  • Product
  • Region
Dimensions

Attributes used for analysis.

Example:

Time → Year, Quarter, Month, Day
Product → Category, Brand, Item
Region → Country, State, City
Measures

Numerical data for analysis:

  • Total sales
  • Profit
  • Quantity

Fact Table & Dimension Table

Fact Table:
  • Contains measures
  • Example: sales_amount, units_sold
Dimension Tables:
  • Describe business entities
  • Example: dim_product, dim_customer

IBM Data Warehousing Guides

Star Schema vs Snowflake Schema

Star Schema
  • Central fact table
  • Surrounding dimension tables
  • Simple and fast
Diagram:
         dim_customer
               |
dim_time — fact_sales — dim_product
               |
         dim_region
Snowflake Schema

Dimensions broken into sub-dimensions.

Example:

dim_location → country → state → city

More normalized but slower.

OLAP: Online Analytical Processing

OLAP allows fast analysis using cubes.

MOLAP
  • Multi-dimensional OLAP
  • Data stored in optimized cube format
  • Fastest queries
ROLAP
  • Relational OLAP
  • Uses relational DB tables
  • Scalable but slower
HOLAP
  • Hybrid OLAP
  • Combines MOLAP + ROLAP

OLAP Operations

Roll-Up (Aggregation)

Going from detailed data → summarized data.

Example:

Day → Month → Year
City → Country
Drill-Down (Reverse of Roll-Up)

Going from summarized → detailed.

Example:

Year → Quarter → Month → Day
Slice

Selecting a single dimension.

Example:

Only 2023 data
Dice

Selecting multiple dimensions.

Example:

2022 & 2023 + Product A + Region Asia
Pivot

Rotating cube view to see data from a new angle.

Warehouse Design Process

Requirements Collection

Meet stakeholders to define analytical needs.

Dimensional Modeling

Design fact & dimension tables.

Building Cubes

Define:

  • Measures
  • Dimensions
  • Aggregations

Case Studies

Retail Industry (Walmart, Amazon)

Used for:

  • Product demand forecasts
  • Store performance
  • Customer segmentation
  • Basket analysis
Banking

Used for:

  • Fraud detection
  • Customer insights
  • Credit scoring
Healthcare

Used for:

  • Patient analytics
  • Disease surveillance
  • Hospital workflow optimization

How Warehousing Supports Data Mining

Improved Data Quality

Cleaning + integration ensures high-quality data.

Efficient Query Processing

Cubes enable fast aggregations.

Clean & Integrated Data

Data mining models perform better.

Summary

Lecture 12 explained the foundation of Data Warehousing & OLAP for Data Mining. Students learned about architectures, ETL, dimensional modeling, star/snowflake schema, cubes, OLAP operations, and industry use cases. This prepares learners to handle large-scale analytical environments and supports future advanced mining projects.

People also ask:

What is a data warehouse?

A centralized storage system for analytical data.

What is OLAP used for?

Fast multi-dimensional analysis.

What is the difference between star and snowflake schema?

Star = simple, denormalized.
Snowflake = normalized, complex.

What is the role of ETL?

To extract, clean, and load data into the warehouse.

Why do we use cubes in OLAP?

For fast multidimensional queries.

Leave a Reply

Your email address will not be published. Required fields are marked *