DescriptionCollege of Computing and Informatics

Project

Deadline: Monday 13/2/2023 @ 23:59

[Total Mark is 14]

Student Details:

CRN:

Name:

Name:

Name:

ID:

ID:

ID:

Instructions:

• You must submit two separate copies (one Word file and one PDF file) using the Assignment Template on

Blackboard via the allocated folder. These files must not be in compressed format.

• It is your responsibility to check and make sure that you have uploaded both the correct files.

• Zero mark will be given if you try to bypass the SafeAssign (e.g. misspell words, remove spaces between

words, hide characters, use different character sets, convert text into image or languages other than English

or any kind of manipulation).

• Email submission will not be accepted.

• You are advised to make your work clear and well-presented. This includes filling your information on the cover

page.

• You must use this template, failing which will result in zero mark.

• You MUST show all your work, and text must not be converted into an image, unless specified otherwise by

the question.

• Late submission will result in ZERO mark.

• The work should be your own, copying from students or other resources will result in ZERO mark.

• Use Times New Roman font for all your answers.

Description and Instructions

Pg. 01

Description and Instructions

The objective of this project is to use the machine learning software Weka to experiment

different data preprocessing techniques (data cleaning, data reduction, normalization…)

and different data mining techniques (Frequent pattern mining techniques, supervised

learning techniques, unsupervised learning techniques) on a selected dataset.

–

You should use the same project document to prepare your answer. A word file

and a pdf file should be provided.

–

The number of students per project is 3 or 4 (maximum). You should fill the work

distribution table provided at the end of this project.

–

Using/copying ideas from previous years will result in zero mark.

–

You should select one of the datasets from any Machine Learning Repositories:

–

•

http://archive.ics.uci.edu/ml/

•

https://www.kaggle.com/datasets

•

…

The dataset may follow the following requirements

•

Number of instances: at least 300

•

Number of attributes: at least 10

•

The dataset should contain different types of attributes, some null values,

some missing values, different scales of attributes…

Note: You can modify the dataset manually by adding some attributes, removing

some values, adding some data discrepancies …

Question One

Pg. 02

Learning

Outcome(s): 1

Define different

data mining tasks,

problems and the

algorithms most

appropriate for

addressing them

Question One

2 Marks

1. Prepare a CSV OR ARFF format data file of the data. Add a screenshot of the

prepared dataset.

2. Load the dataset in Weka.

3. Using Weka, describe in detail:

a. The selected dataset: number of attributes, number of instances,

anomalies in the dataset…

b. Each attribute of the dataset (description, type, range, number of missing

values, min and max for the numeric values,…)

Add screenshots of the exploration of the data.

Question Two

Pg. 03

Learning

Outcome(s):1

Define different

Question Two

2 Marks

Apply at least two data preprocessing techniques (data cleaning, data reduction,

normalization…) to the select dataset.

data mining tasks,

problems and the

algorithms most

appropriate for

addressing them

Explain in detail why do you choose these techniques and add screenshots of data

before and after applying each technique.

Question Three

Pg. 04

Learning

Outcome(s):3

Employ data

mining and data

warehousing

techniques to

solve real-world

problems.

Learning

Outcome(s):4

Choose data

mining algorithms

for problems they

are specifically

designed for.

Question Three

2 Marks

Apply Apriori algorithm on your dataset with different support and confidence values

(at least 3 different scenarios). Discuss the generated rules.

Note:

–

You can use only a subset of the attributes of your dataset to apply A priori

algorithm.

Add screenshots of the output of the algorithm for each scenario.

Choose the best scenario.

Question Four

Pg. 05

Learning

Outcome(s):3

Employ data

Question Four

4 Marks

Apply two supervised machine learning techniques (decision trees, linear regression,

SVM, neural networks…) on the selected dataset.

mining and data

warehousing

techniques to

solve real-world

problems.

Learning

Outcome(s):4

Choose data

mining algorithms

for problems they

are specifically

designed for.

For each selected technique:

–

Give a brief description of the technique.

Provide the result and accuracies of the constructed model for different

parameters (at least 2 scenarios) and discuss them with supporting screenshots.

Question Five

Pg. 06

Question Five

Learning

Outcome(s):3

4 Marks

Apply two unsupervised machine learning techniques (K-means, Hierarchical

clustering…) on the selected dataset.

Employ data

mining and data

For each selected technique:

warehousing

–

techniques to

solve real-world

problems.

Learning

Outcome(s):4

Choose data

mining algorithms

for problems they

are specifically

designed for.

Give a brief description of the technique

Provide the result of each technique using different settings (at least 2 scenarios)

and discuss them with supporting screenshots.

Work Distribution

Pg. 07

Work Distribution

Fill the following table.

Student ID

Student Name

Questions

الجامعة السعودية االلكترونية

الجامعة السعودية االلكترونية

1

26/12/2021

College of Computing and Informatics

Bachelor of Science in Information Technology

IT446

Data Mining and Data Warehousing

2

IT446

Data Mining and Data Warehousing

Week 1

Introduction

3

1.

2.

3.

4.

Saudi Vision 2030

Saudi Digital Library

Git and Github

Stackoverflow

Contents

4

• Demonstrate the Saudi Digital Library and searching for

knowledge

• Understand the Major Objective of Saudi Vision 2030

Weekly

Learning

Outcomes

5

• Saudi Vision 2030

Weekly

Learning

Outcomes

6

Saudi Vision 2030

• Under the leadership of the Custodian of the Two Holy Mosques, Vision

2030 was launched, a roadmap drawn up by His Royal Highness the Crown

Prince, to harness the strengths God bestowed upon us – our strategic

position, investment power and place at the center of Arab and Islamic

worlds. The full attention of the Kingdom, and our Leadership, is on

harnessing our potential to achieve our ambitions.

7

Reference: https://www.vision2030.gov.sa/

Saudi Vision 2030

• Vision 2030 Draws on The Nation’s Intrinsic Strengths

1. Saudi Arabia is the land of the Two Holy Mosques which positions the Kingdom at

the heart of the Arab and Islamic worlds

2. Saudi Arabia is using its investment power to create a more diverse and

sustainable economy

3. The Kingdom is using its strategic location to build its role as an integral driver of

international trade and to connect three continents: Africa, Asia and Europe

8

Reference: https://www.vision2030.gov.sa/

• Saudi Digital Library

Weekly

Learning

Outcomes

9

Saudi Digital Library

• Saudi Digital Library, is the largest academic gathering of information

sources in the Arab world, with more than (310،000) scientific reference,

covering all academic disciplines, and the continuous updating of the

content in this; thus achieving huge accumulation cognitive in the long run.

Library has contracted with more than 300 global publisher. The library

won the award for the Arab Federation for Libraries and Information

‘know’ for outstanding projects in the Arab world in 2010.

10

Reference: https://sdl.edu.sa/sdlportal/en/publishers.aspx

Saudi Digital Library

• Advantages:

• One central management, manages this huge content, and constantly updated.

• Common share for the benefit of, any University would benefit other universities

that are now available to the other, in any scientific field.

• Enhance the status of universities when evaluating, for Academic Accreditation, and

through sources rich, modern, and publish the best Global Publishers.

• Bridging the gap between Saudi universities, where emerging universities can get

the same service, you get major Saudi universities.

11

Reference: https://sdl.edu.sa/sdlportal/en/publishers.aspx

• Git and Github

Weekly

Learning

Outcomes

12

What is GIT?

• Git

• is a version control system for tracking changes in computer files and

coordinating work on those files among multiple people.

• It is primarily used for source code management in software development and

was initially created by Linus Torvalds for development of the Linux Kernel.

• Git is not Github.

• Git is the version control software

• Github is a git repository hosting service which offers all the source code

management provided in git.

• Github is where you upload your git repository.

13

Reference: https://medium.com/@lucasmaurer/git-gud-the-working-tree-staging-area-and-local-repo-a1f0f4822018

Version control

• Version control is:

• a system that records changes to a file or set of files over time so that you can

recall specific versions later.

• If you are a graphic or web designer and want to keep every version of an

image or layout (which you would most certainly want to), a Version Control

System (VCS) is a very wise thing to use.

• It allows you to revert selected files back to a previous state, revert the entire

project back to a previous state, compare changes over time, see who last

modified something that might be causing a problem, who introduced an issue

and when, and more.

• Using a VCS also generally means that if you lose files, you can easily recover.

14

Reference: https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control

What Is GitHub

• GitHub is

• a for-profit company that offers a cloud-based Git repository hosting

service.

• Essentially, it makes it a lot easier for individuals and teams to use Git for

version control and collaboration.

• GitHub’s interface is user-friendly enough so even novice coders

can take advantage of Git.

15

Reference: https://guides.github.com/activities/hello-world/

Create a Repository

16

Reference: https://guides.github.com/activities/hello-world/

Make and commit changes

17

Reference: https://guides.github.com/activities/hello-world/

• Stack overflow

Weekly

Learning

Outcomes

18

Stack overflow

• Stack Overflow is

• a question-and-answer website for professional and enthusiast programmers.

• It is the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood

and Joel Spolsky.

• It features questions and answers on a wide range of topics in computer programming.

• Stack Overflow only accepts questions about programming that are tightly focused on a

specific problem.

• Questions of a broader nature–or those inviting answers that are inherently a matter of

opinion– are usually rejected by the site’s users and marked as closed.

https://stackoverflow.com/

19

Reference: https://en.wikipedia.org/wiki/Stack_Overflow

Thank You

20

الجامعة السعودية االلكترونية

الجامعة السعودية االلكترونية

26/12/2021

College of Computing and Informatics

Bachelor of Science in Information Technology

IT446

Data Mining and Data Warehousing

IT446

Data Mining and Data Warehousing

Chapter 1 (introduction)

Chapter 2 (Getting to Know Your Data)

Week 2

Week Learning Outcomes

• Define data mining and its applications.

• Recognize kinds of data and patterns that can be mine

• Describe data objects and attribute types

4

Chapter 1: Introduction 1.1 – 1.5

▪

Why Data Mining?

▪

What is Data Mining?

▪

What kinds of Data can be Mined?

▪

What kind of Patters Can be Mined?

▪

Which Technologies Are Used?

▪

What Kind of Applications Are Targeted?

▪

Major Issues in Data Mining

Chapter 2: Getting to Know Your Data 2.1- 2.3

▪

Data Objects and Attributes Types

▪

Basic Statistical Descriptions of Data

▪

Data Visualization

Why Data Mining?

• The rapid growth of Data

• Data collection and availability caused by Digital Transformation and

Automation

• Across all fields

• Business: Web, e-commerce, transactions, stocks, …

• Science: Remote sensing, bioinformatics, scientific simulation, …

• Society and everyone: news, digital cameras, YouTube

• As result of that, we are drowning in data, data mining help finding knowledge

within data.

• “Necessity is the mother of invention”—Data mining—Automated analysis of

massive data sets

6

What Is Data Mining?

• Data mining has multiple names depending on time and field

• Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging,

information harvesting, business intelligence, etc.

• Data mining is a process of extracting knowledge from data regardless of

the reasons and what these knowledge is used for.

• In other words, the extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) patterns or knowledge

from data.

7

What Is Data Mining?

• Watch out! From naming everything Data Mining

• It is easy to overstate what data mining is and include similar tasks such

as:

• Simple search and query processing

• Expert Systems that does not rely on data

8

Knowledge Discovery (KDD) Process

Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse

Selection

Data Cleaning

Data Integration

Databases

9

Knowledge Discovery (KDD) Process

• Data cleaning (to remove noise and inconsistent data)

• Data integration (where multiple data sources may be combined)

• Data selection (where data relevant to the analysis task are retrieved from the

database)

• Data transformation (where data are transformed and consolidated into forms

appropriate for mining by performing summary or aggregation operations)

• Data mining (an essential process where intelligent methods are applied to extract

data patterns or knowledge)

• Pattern evaluation (to identify the truly interesting patterns representing knowledge

based on interestingness measures—see Section 1.4.6)

• Knowledge presentation (where visualization and knowledge representation

techniques are used to present mined knowledge to users)

10

What kinds of Data can be Mined?

• Database-oriented data sets and applications

• Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications

• Data streams and sensor data

• Time-series data, temporal data, sequence data (incl. bio-sequences)

• Structure data, graphs, social networks and multi-linked data

• Object-relational databases

• Heterogeneous databases and legacy databases

• Spatial data and spatiotemporal data

• Multimedia database

• Text databases

• The World-Wide Web

11

What kind of Patters Can be Mined?

Data Mining Techniques Used

1) Generalization

• Generalize, summarize, and contrast data characteristics, e.g., dry vs.

wet region

2) Association and Correlation Analysis

• Frequent patterns (or frequent itemsets) e.g. What items are

frequently purchased together in your supermarket?

• Association, correlation vs. causality

• A typical association rule: Bread -> Milk [0.5%, 75%] (support,

confidence)

• Are strongly associated items also strongly correlated?

• Are all pattern interesting?

12

What kind of Patters Can be Mined?

Data Mining Techniques Used

3) Classification (supervised learning)

• Build a model that can classify objects based on their characteristics.

Model will be learn to do so by looking at previous examples (refer to

as training data)

• Applications include credit card fraud detection and patient

diagnoses.

4) Cluster Analysis (unsupervised learning)

• Group data objects together based on their characteristics where

objects in a group are similar and objects belong to different groups

are dissimilar.

• Application include grouping online shopper or news readers to

better target them with advertisements.

13

What kind of Patters Can be Mined?

Data Mining Techniques Used

5) Outlier Analysis

• Outlier: A data object that does not comply with the general behavior

of the data

• Finding characteristics that a group of data objects share and those

do not closely share those characteristics are considered outliers.

• noise vs. outliers.

• Outliers analysis can use classification or clustering techniques.

• Application include transaction fraud detection, and rare event

analysis.

14

Which Technologies Are Used?

Machine

Learning

Applications

Algorithm

Pattern

Recognition

Data Mining

Database

Technology

15

Statistics

Visualization

High-Performance

Computing

Why Confluence of Multiple Disciplines?

• Tremendous amount of data

• Algorithms must be highly scalable to handle such as tera-bytes of data

• High-dimensionality of data

• Micro-array may have tens of thousands of dimensions

• High complexity of data

• Data streams and sensor data

• Time-series data, temporal data, sequence data

• Structure data, graphs, social networks and multi-linked data

• Heterogeneous databases and legacy databases

• Spatial, spatiotemporal, multimedia, text and Web data

• Software programs, scientific simulations

16

Applications of Data Mining

• Web page analysis: from web page classification, clustering to PageRank &

HITS algorithms

• Collaborative analysis & recommender systems

• Basket data analysis to targeted marketing

• Biological and medical data analysis: classification, cluster analysis

(microarray data analysis), biological sequence analysis, biological network

analysis

• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009

issue)

• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server

Analysis Manager, Oracle Data Mining Tools) to invisible data mining

17

Major Issues in Data Mining (1)

• Mining Methodology

• Mining various and new kinds of knowledge

• Mining knowledge in multi-dimensional space

• Data mining: An interdisciplinary effort

• Boosting the power of discovery in a networked environment

• Handling noise, uncertainty, and incompleteness of data

• Pattern evaluation and pattern- or constraint-guided mining

• User Interaction

• Interactive mining

• Incorporation of background knowledge

• Presentation and visualization of data mining results

18

Major Issues in Data Mining (2)

• Efficiency and Scalability

• Efficiency and scalability of data mining algorithms

• Parallel, distributed, stream, and incremental mining methods

• Diversity of data types

• Handling complex types of data

• Mining dynamic, networked, and global data repositories

• Data mining and society

• Social impacts of data mining

• Privacy-preserving data mining

• Invisible data mining

19

Data Objects and Attributes Types

• Data sets are made up of data objects.

• A data object represents an entity—

• in a sales database, the objects may be customers, store items, and

sales;

• in a medical database, the objects may be patients;

• in a university database, the objects may be students, professors, and

courses.

• Data objects are typically described or represented by attributes.

• Data objects can also be referred to as samples, examples, instances, data

points, or objects.

20

Attributes

• An attribute is a data field, representing a characteristic or feature of a data

object.

• The nouns attribute, dimension, feature, and variable are often used

interchangeably in the literature.

• The term dimension is commonly used in data warehousing.

• Machine learning literature tends to use the term feature.

• Statisticians prefer the term variable.

• Data mining and database professionals commonly use the term

attribute, and we do in this course.

21

Attributes Types

• The type of an attribute is determined by the set of possible values the

attribute can have.

• These are the four types:

1) Nominal Attributes:

2) Binary Attributes

3) Ordinal Attributes

4) Numeric Attributes

• Interval-Scaled Attributes

• Ratio-Scaled Attributes

22

Attributes Types

• The type of an attribute is determined by the set of possible values the

attribute can have. These are the four types:

1) Nominal Attributes: Each value represents some kind of category,

code, or state, and so nominal attributes are also referred to as

categorical. The values do not have any meaningful order. E.g. Hair

color, marital status, occupation, ID numbers, zip codes

2) Binary Attributes: A binary attribute is a nominal attribute with only

two categories or states: 0 or 1, where 0 typically means that the

attribute is absent, and 1 means that it is present. Binary attributes

are referred to as Boolean if the two states correspond to true and

false. E.g. Medical test result or gender

23

Attributes Types

3) Ordinal Attributes: an attribute with possible values that have a

meaningful order or ranking among them, but the magnitude

between successive values is not known. E.g.

• Size = {small, medium, large}

• Grades = {A, B, C, D, F}

• Army rankings … Etc

24

Attributes Types

4) Numeric Attributes: is quantitative; that is, it is a measurable

quantity, represented in integer or real values. Numeric attributes can

be interval-scaled or ratio-scaled.

• Interval-Scaled Attributes are measured on a scale of equal-size

units. No true zero-point. E.g. temperature in C˚or F˚, calendar dates.

• Ratio-Scaled Attributes are numeric attribute with an inherent zeropoint. E.g., area, weight, height, length, counts, monetary quantities.

Ratio between two data object’s attribute can be calculated.

25

Discrete vs. Continuous Attributes

• Discrete Attribute (Nominal, Binary and Ordinal)

• Has only a finite or countably infinite set of values E.g., zip codes,

profession, or the set of words in a collection of documents

• Sometimes, represented as integer variables

• Continuous or Numeric Attribute (Ratio and Interval)

• Has real numbers as attribute values

• E.g., temperature, height, or weight

• Practically, real values can only be measured and represented using a

finite number of digits

• Continuous attributes are typically represented as floating-point

variables

26

Basic Statistical Descriptions of Data

• Motivation

• To better understand the data

• How?

• by measuring data’s central tendency and distribution (variation and

spread).

• Measuring the Central Tendency characteristics

• Mean, Median, Mode and Midrange.

• Measuring the Data dispersion (or distribution) characteristics

• Range, max, min, quantiles, outliers, variance and standard deviation.

27

Basic Statistical Descriptions of Data

Measuring the Central Tendency characteristics

• Mean

• Mean is average value for all the data also is the center of data. It calculated by

dividing the sum of all values over the sample size.

1 n

x = xi

n i =1

• Trimmed mean

• The mean can also be calculated on a trimmed data by removing the extreme

values.

• Weighted average or Weighted arithmetic mean

• Differ from regular mean by giving each value a weight that reflect its significance

or importance.

n

x=

w x

i =1

n

i

i

w

i

i =1

28

Basic Statistical Descriptions of Data

Measuring the Central Tendency characteristics

• Median

• After sorting the data, the median is the middle value if the size of data

is an odd number otherwise the sum of the two middle numbers

divided by 2.

• Sorting can be computational expensive. However, without sorting we

can approximate the value.

median = L1 + (

n / 2 − ( freq )l

freq median

29

) width

Basic Statistical Descriptions of Data

Measuring the Central Tendency characteristics

• Mode is a value that occurs most frequently in the data

• Sometimes we have multiple values with the same highest frequency.

(Unimodal or Multimodel e.g. Bimodal, Trimodal)

• Only one value with highest frequency = Unimodel

• Two values with highest and equally frequent values = bimodal

• Three values with highest and most frequent values = trimodal

• Unimodel mode can also be approximated using

mean − mode = 3 (mean − median)

30

Basic Statistical Descriptions of Data

Measuring the Central Tendency characteristics

• Midrange is another measure of central tendency. It is simply the average

of the min and max values of the data.

• This is easy to compute using the SQL aggregate functions, max() and

min().

• When data have a symmetric distribution all central tendency measure

return the same center value.

• But data usually do not!

31

Measuring the Central Tendency characteristics – Example

• Suppose we have the following values for salary (in thousands of dollars), shown in

increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

n

1

• Mean x =

xi

n i =1

• Trimmed mean

• In this example, remove 30, 36 and 110. Then, recalculate.

• Median

• Mode

• 52 and 70 are the modes (bimodal)

• Midrange

• (30 (min) + 110 (max) ) / 2 = 70

32

Basic Statistical Descriptions of Data

Measuring the Central Tendency characteristics

• Data in real world application tend to have asymmetric data distribution or

positively (negatively) skewed data.

symmetric

positively skewed

33

negatively skewed

Basic Statistical Descriptions of Data

Measuring the Dispersion (distribution) of Data

• Quartiles, outliers and boxplots

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1

• Five number summary: min, Q1, median, Q3, max

• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and

plot outliers individually

• Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)

• Variance: (algebraic, scalable computation)

1 n

1 n 2

2

= ( xi − ) = xi − 2

N i =1

N i =1

2

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

34

Dispersion (distribution) of Data

• Popular visualization plots visualize data distribution

• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis represent frequencies

• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi %

of data are xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariate distribution

against the corresponding quantiles of another

• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the

plane

35

Boxplot

• Five-number summary of a distribution

• Minimum, Q1, Median, Q3, Maximum

• Boxplot

• Data is represented with a box

• The ends of the box are at the first and

third quartiles, i.e., the height of the box

is IQR

• The median is marked by a line within the

box

• Whiskers: two lines outside the box

extended to Minimum and Maximum

• Outliers: points beyond a specified outlier

threshold, plotted individually

36

Histogram

• Histograms (or frequency histograms) are at least a century old and are widely used.

• The height of the bar indicates the frequency (i.e., count) of the values that fill within

range of the bar. The resulting graph is more commonly known as a bar chart.

37

Quantile plot

• Displays all of the data (allowing the user to assess both the overall behavior and

unusual occurrences)

• Plots quantile information: For a data xi data sorted in increasing order, fi indicates

that approximately 100 fi% of the data are below or equal to the value xi

38

Quantile-Quantile (Q-Q) Plot

• Graphs the quantiles of one univariate distribution against the corresponding

quantiles of another

• View: Is there is a shift in going from one distribution to another?

• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit

prices of items sold at Branch 1 tend to be lower than those at Branch 2.

39

Scatter Plot

• Provides a first look at bivariate data to see clusters of points, outliers, etc

• Each pair of values is treated as a pair of coordinates and plotted as points in the plane

40

Scatter Plot – Correlation

Positively Correlated

Negatively Correlated

41

No Correlation

Data Visualization – Pixel-oriented

• Data visualization

• aims to communicate data clearly and effectively through graphical

representation. Data visualization has been

• used extensively in many applications—for example, at work for

reporting, managing business operations, and tracking progress of tasks.

• used to discover data relationships that are otherwise not easily

observable by looking at the raw data.

• Provide a visual proof of computer representations derived

42

Data Visualization

• Categorization of visualization methods:

• Pixel-oriented visualization techniques

• Geometric projection visualization techniques

• Icon-based visualization techniques

• Hierarchical visualization techniques

• Visualizing complex data and relations

43

Pixel-oriented visualization techniques

• Each window represent an attribute (a feature)

• Each data object is represented in a pixel on each window

• The density of pixel color is proportional to the value.

Income (a)

(b) Credit Limit

(c) transaction volume

44

(d) age

Laying Out Pixels in Circle Segments

• To save space and show the connections among multiple dimensions, space

filling is often done in a circle segment

Representing a data record (a)

in circle segment

(b) Laying out pixels in circle segment

45

Geometric projection visualization techniques

• Visualization of geometric transformations and projections of the data

• Methods

• Direct visualization

• Scatterplot and scatterplot matrices

• Landscapes

• Projection pursuit technique: Help users find meaningful projections of

multidimensional data

• Prosection views

• Hyperslice

• Parallel coordinates

46

Scatterplot matrices and Direct visualization

47

3D Scatterplot and 2D Scatterplot with Cartesian Coordinates

48

Icon-based visualization techniques

• Visualization of the data values as features of icons

• Typical visualization methods

• Chernoff Faces

• Stick Figures

• General techniques

• Shape coding: Use shape to represent certain information encoding

• Color icons: Use color icons to encode more information

• Tile bars: Use small icons to represent the relevant feature vectors in

document retrieval

49

Icon-based visualization techniques

50

Hierarchical visualization techniques

• Visualization of the data using a hierarchical partitioning into

subspaces

• Methods

• Dimensional Stacking

• Worlds-within-Worlds

• Tree-Map

• Cone Trees

• InfoCube

51

Hierarchical visualization techniques – Dimensional Stacking

52

Hierarchical visualization techniques – InfoCube

53

Visualizing complex data and relations

• Visualizing non-numerical data: text and social networks

• Tag cloud: visualizing user-generated tags

The importance of tag is

represented by font size/color

◼ Besides text data, there are

also methods to visualize

relationships, such as

visualizing social networks

◼

Newsmap: Google News Stories in 2005

54

Required Reading

1. Data Mining: Concepts and Techniques, Chapter 1 (introduction)

2. Data Mining: Concepts and Techniques, Chapter 2 (Getting to

Know Your Data)

Recommended Reading

1. Data Mining: Practical Machine Learning Tools and Techniques Chapter 1 (What is

it all about?)

2. Data Mining: Practical Machine Learning Tools and Techniques Chapter 2 (Input:

concept, instances, attributes)

This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)

Thank You

56

الجامعة السعودية االلكترونية

الجامعة السعودية االلكترونية

1

26/12/2021

College of Computing and Informatics

Bachelor of Science in Information Technology

IT446

Data Mining and Data Warehousing

2

IT446

Data Mining and Data Warehousing

Chapter 3 (Data Preprocessing)

Week 3

3

Week Learning Outcomes

▪ Recognize major tasks in data preprocessing.

▪ Perform data cleaning.

▪ Perform data integration.

4

Chapter 3: Data Preprocessing

▪ Data Preprocessing Overview

▪ Data Cleaning

▪ Data Integration

▪ Preprocessing Practices on Weka

5

Data Quality: Why Preprocess the Data?

• Preprocessing improves data quality including

• Accuracy: correctness of data i.e., having incorrect attribute values could

be due to instruments used may be faulty or human error (or intentional).

• Completeness: full information is available i.e. Incomplete data can occur

for values that are not always available, such as customer information for

sales transaction data.

• Consistency: data from all sources are the same i.e. Two different users

may have very different assessments or live in different time zones.

• Timeliness: time difference and delay i.e. some store branches has delay in

syncing sales

• Believability: reflects how much the data are trusted by users

• Interpretability: reflects how easy the data are understood.

6

Major Tasks in Data Preprocessing

• Data cleaning routines work to “clean” the data by filling in missing values,

smoothing noisy data, identifying or removing outliers, and resolving

inconsistencies.

• Data integration is the process integrating data from multiple sources

(databases, data cubes, or files).

• Data reduction obtains a reduced representation of the data set that is

much smaller in volume but produces the same (or almost the same)

mining results

• Data transformation convert the data into appropriate forms for better

mining results.

7

Data Preprocessing Overview

8

Data Cleaning

• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument

faulty, human or computer error, transmission error

• incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

• e.g., Occupation=“ ” (missing data)

• noisy: containing noise, errors, or outliers

• e.g., Salary=“−10” (an error)

• inconsistent: containing discrepancies in codes or names, e.g.,

• Age=“42”, Birthday=“03/07/2010”

• Was rating “1, 2, 3”, now rating “A, B, C”

• discrepancy between duplicate records

• Intentional (e.g., disguised missing data)

• Jan. 1 as everyone’s birthday?

9

Incomplete Data (Missing Values)

• Data is not always available

• E.g., many tuples have no recorded value for several attributes, such as

customer income in sales data

• Missing data may be due to

• equipment malfunction

• inconsistent with other recorded data and thus deleted

• data not entered due to misunderstanding

• certain data may not be considered important at the time of entry

• not register history or changes of the data

• Missing data may need to be inferred

10

How to Handle Missing Values?

• Ignore the sample: not effective when the % of ignored sample

is too high.

• Fill in the missing value manually: tedious + infeasible?

• Fill in it automatically with

• a global constant : e.g., “unknown”, a new class?!

• the attribute mean for all data

• the attribute mean for all samples belonging to the same

class

• the most probable value: based of some statistical models.

11

Noisy Data

• Noise: random error in data

• Incorrect attribute values may be due to

• faulty data collection instruments

• data entry problems

• data transmission problems

• technology limitation

• inconsistency in naming convention

• Other data problems which require data cleaning

• duplicate records

• incomplete data

• inconsistent data

12

How to Handle Noisy Data?

• Binning

• first sort data and partition into (equal-frequency) bins

• then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

• Regression

• smooth by fitting the data into regression functions

• Clustering

• detect and remove outliers

• Combined computer and human inspection

• detect suspicious values and check by human (e.g., deal with

possible outliers)

13

Data Cleaning as a Process

• Data discrepancy detection

• Use metadata (e.g., domain, range, dependency, distribution)

• Check field overloading

• Check uniqueness rule, consecutive rule and null rule

• Use commercial tools

• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to

detect errors and make corrections

• Data auditing: by analyzing data to discover rules and relationship to detect violators

(e.g., correlation and clustering to find outliers)

• Data migration and integration

• Data migration tools: allow transformations to be specified

• ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations

through a graphical user interface

• Integration of the two processes

• Iterative and interactive (e.g., Potter’s Wheels)

14

Data Integration

• Data integration:

• Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id = B.cust-#

• Integrate metadata from different sources

• Entity identification problem:

• Identify real world entities from multiple data sources, e.g., Bill Clinton =

William Clinton

• Detecting and resolving data value conflicts

• For the same real world entity, attribute values from different sources

are different

• Possible reasons: different representations, different scales, e.g., metric

vs. British units

15

Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple databases

• Object identification: The same attribute or object may have different

names in different databases

• Derivable data: One attribute may be a “derived” attribute in another

table, e.g., quarters revenue and annual revenue

• Redundant attributes may be able to be detected by correlation analysis

and covariance analysis

• Careful integration of the data from multiple sources may help reduce/avoid

redundancies and inconsistencies and improve mining speed and quality

16

Correlation Analysis (Nominal Data)

• Correlation analysis is used on attributes or features to determine their

dependencies on among each other.

• Χ2 (chi-square) test is to find correlation (dependency) among two features.

2

(

Observed

−

Expected

)

2 =

Expected

• The larger the Χ2 value, the more likely the variables are related

• The cells that contribute the most to the Χ2 value are those whose actual

count is very different from the expected count

• Correlation does not imply causality

• # of hospitals and # of car-theft in a city are correlated

• Both are causally linked to the third variable: population

17

Chi-Square Calculation: An Example

Like science fiction

Play chess

250(90)

Not play chess

200(360)

Sum (row)

450

Not like science fiction

50(210)

1000(840)

1050

Sum(col.)

300

1200

1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)

2

2

2

2

(

250

−

90

)

(

50

−

210

)

(

200

−

360

)

(

1000

−

840

)

2 =

+

+

+

= 507 .93

90

210

360

840

• It shows that like_science_fiction and play_chess are correlated in the group

18

Visually Evaluating Correlation

Scatter plots showing the

similarity from –1 to 1.

19

Covariance (Numeric Data)

• Covariance is similar to correlation

Correlation coefficient:

• where n is the number of tuples, A and B are the respective mean or expected

values of A and B, σA and σB are the respective standard deviation of A and B.

• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their

expected values.

• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply
independence
20
Co-Variance: An Example
• It can be simplified in computation as
• Suppose two stocks A and B have the following values in one week: (2, 5),
(3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.

21

Weka Tool

• WEKA is an open source tool that provides

• many machine learning algorithms implementations

• variety of datasets

• a variety of tools for transforming datasets

• Ability to preprocess a dataset and train a model

• Ability to analyze the resulting model and its performance

• all of the above without writing any program code.

• WEKA is available from http://www.cs.waikato.ac.nz/ml/weka

• For more information

https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pd

f

22

Weka Tool

23

Weka Tool

24

Preprocessing Practices on Weka

• Practical in class exercise:

• Pick a dataset and load it on Weka

• Look at the data graphs

• Apply some preprocessing techniques

• Look how the data has changed

• Choose another preprocessing techniques

• Look at the data graphs again.

25

Required Reading

1. Data Mining: Concepts and Techniques, Chapter 3: Data

Preprocessing 3.1, 3.2 and 3.3

Recommended Reading

1. Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques,

Chapter 2- Input: concepts, instances, attributes

26

This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)

Thank You

27

الجامعة السعودية االلكترونية

الجامعة السعودية االلكترونية

26/12/2021

1

College of Computing and Informatics

Bachelor of Science in Information Technology

IT446

Data Mining and Data Warehousing

2

IT446

Data Mining and Data Warehousing

Chapter 3 (Data Preprocessing II)

Week 4

3

Week Learning Outcomes

▪ Describe various data reduction and transformations

techniques.

▪ Explain the rationale for data reduction and transformation

▪ Employ data reduction and transformation techniques

▪ Employ several types of similarity measures

4

Chapter 3: Data Preprocessing II

▪ Data Reduction

▪ Data Transformation

▪ Data Discretization

▪ Measuring Data Similarity and Dissimilarity (Chapter 2.4)

▪ Preprocessing Practices II on Weka

5

Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much

smaller in volume but yet produces the same (or almost the same) analytical results

• Why data reduction? — A database/data warehouse may store terabytes of data.

Complex data analysis may take a very long time to run on the complete data set.

• Data reduction strategies

• Dimensionality reduction, e.g., remove unimportant attributes

• Wavelet transforms

• Principal Components Analysis (PCA)

• Feature subset selection, feature creation

• Numerosity reduction (some simply call it: Data Reduction)

• Regression and Log-Linear Models

• Histograms, clustering, sampling

• Data cube aggregation

• Data compression (lossy/lossless)

6

Data Reduction 1: Dimensionality Reduction

• Curse of dimensionality

• When dimensionality increases, data becomes increasingly sparse

• Density and distance between points, which is critical to clustering, outlier analysis,

becomes less meaningful

• The possible combinations of subspaces will grow exponentially

• Dimensionality reduction

• Avoid the curse of dimensionality

• Help eliminate irrelevant features and reduce noise

• Reduce time and space required in data mining

• Allow easier visualization

• Dimensionality reduction techniques

• Wavelet transforms

• Principal Component Analysis

• Supervised and nonlinear techniques (e.g., feature selection)

7

What Is Wavelet Transform?

• Decomposes a signal into different

frequency subbands

• Applicable to n-dimensional signals

• Data are transformed to preserve relative

distance between objects at different

levels of resolution

• Allow natural clusters to become more

distinguishable

• Used for image compression

8

Wavelet Transformation

• Discrete wavelet transform (DWT) for linear signal processing, multi-resolution

analysis

• Compressed approximation: store only a small fraction of the strongest of the

wavelet coefficients

• Similar to discrete Fourier transform (DFT), but better lossy compression, localized in

space

• Method:

• Length, L, must be an integer power of 2 (padding with 0’s, when necessary)

• Each transform has 2 functions: smoothing, difference

• Applies to pairs of data, resulting in two set of data of length L/2

• Applies two functions recursively, until reaches the desired length

Haar2

Daubechie4

9

Wavelet Decomposition

• Wavelets: A math tool for space-efficient hierarchical decomposition of

functions

• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2, 0, 0, -1, 1, 0]

• Compression: many small detail coefficients can be replaced by 0’s, and only

the significant coefficients are retained

10

Haar Wavelet Coefficients

Coefficient “Supports”

Hierarchical

2.75

decomposition

structure (a.k.a. +

“error tree”) + -1.25

+

+

2

0.5

0

–

–

+

-1

-1

0.5

0

–

+

+

0

0

– +

– +

–

0

2

2

5

4

-1

-1

3

–

+

-1.25

– +

0

+

2.75

4

Original frequency distribution

11

0

+

+

+

–

–

—

+

Why Wavelet Transform?

• Use hat-shape filters

• Emphasize region where points cluster

• Suppress weaker information in their boundaries

• Effective removal of outliers

• Insensitive to noise, insensitive to input order

• Multi-resolution

• Detect arbitrary shaped clusters at different scales

• Efficient

• Complexity O(N)

• Only applicable to low dimensional data

12

Principal Component Analysis (PCA)

• Find a projection that captures the

largest amount of variation in data

• The original data are projected onto a

much smaller space, resulting in

x2

dimensionality reduction.

• We find the eigenvectors of the

covariance matrix, and these

eigenvectors define the new space

e

x1

13

Attribute Subset Selection

• Another way to reduce dimensionality of data

• Redundant attributes

• Duplicate much or all of the information contained in one or more

other attributes

• E.g., purchase price of a product and the amount of sales tax paid

• Irrelevant attributes

• Contain no information that is useful for the data mining task at hand

• E.g., students’ ID is often irrelevant to the task of predicting students’

GPA

14

Heuristic Search in Attribute Selection

• There are 2^d possible attribute combinations of d attributes

• Typical heuristic attribute selection methods:

• Best single attribute under the attribute independence assumption:

choose by significance tests

• Best step-wise feature selection:

• The best single-attribute is picked first

• Then next best attribute condition to the first, …

• Step-wise attribute elimination:

• Repeatedly eliminate the worst attribute

• Best combined attribute selection and elimination

• Optimal branch and bound:

• Use attribute elimination and backtracking

15

Attribute Creation (Feature Generation)

• Create new attributes (features) that can capture the important information in a

data set more effectively than the original ones

• Three general methodologies

• Attribute extraction

• Domain-specific

• Mapping data to new space (see: data reduction)

• E.g., Fourier transformation, wavelet transformation, manifold

approaches (not covered)

• Attribute construction

• Combining features (see: discriminative frequent patterns in Chapter 7)

• Data discretization

16

Data Reduction 2: Numerosity Reduction

• Reduce data volume by choosing alternative, smaller forms of data

representation

• Parametric methods (e.g., regression)

• Assume the data fits some model, estimate model parameters, store

only the parameters, and discard the data (except possible outliers)

• Ex.: Log-linear models—obtain value at a point in m-D space as the

product on appropriate marginal subspaces

• Non-parametric methods

• Do not assume models

• Major families: histograms, clustering, sampling, …

17

Parametric Data Reduction: Regression and Log-Linear Models

• Linear regression

• Data modeled to fit a straight line

• Often uses the least-square method to fit the line

• Multiple regression

• Allows a response variable Y to be modeled as a linear function of

multidimensional feature vector

• Log-linear model

• Approximates discrete multidimensional probability distributions

18

Regression Analysis

• Regression analysis: A collective name for

techniques for the modeling and analysis

of numerical data consisting of values of a

dependent variable (also called response

variable or measurement) and of one or

more independent variables (aka.

explanatory variables or predictors)

• The parameters are estimated so as to give

a “best fit” of the data

• Most commonly the best fit is evaluated by

using the least squares method, but other

criteria have also been used

19

Y1

Y1’

y=x+1

X1

x

• Used for prediction (including

forecasting of time-series data),

inference, hypothesis testing,

and modeling of causal

relationships

Regress Analysis and Log-Linear Models

• Linear regression: Y = w X + b

• Two regression coefficients, w and b, specify the line and are to be estimated

by using the data at hand

• Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….

• Multiple regression: Y = b0 + b1 X1 + b2 X2

• Many nonlinear functions can be transformed into the above

• Log-linear models:

• Approximate discrete multidimensional probability distributions

• Estimate the probability of each point (tuple) in a multi-dimensional space for a

set of discretized attributes, based on a smaller subset of dimensional

combinations

• Useful for dimensionality reduction and data smoothing

20

Histogram Analysis

40

• Divide data into buckets and store

average (sum) for each bucket

• Partitioning rules:

• Equal-width: equal bucket range

• Equal-frequency (or equaldepth)

35

30

25

20

15

10

21

100000

90000

80000

70000

60000

50000

40000

30000

20000

0

10000

5

Clustering for Data Reduction

• Partition data set into clusters based on similarity, and store

cluster representation (e.g., centroid and diameter) only

• Can be very effective if data is clustered but not if data is

“smeared”

• Can have hierarchical clustering and be stored in multidimensional index tree structures

• There are many choices of clustering definitions and clustering

algorithms

• Cluster analysis will be studied in depth in Chapter 10

22

Sampling

• Sampling: obtaining a small sample s to represent the whole

data set N

• Allow a mining algorithm to run in complexity that is potentially

sub-linear to the size of the data

• Key principle: Choose a representative subset of the data

• Simple random sampling may have very poor performance in

the presence of skew

• Develop adaptive sampling methods, e.g., stratified

sampling:

• Note: Sampling may not reduce database I/Os (page at a time)

23

Types of Sampling

• Simple random sampling

• There is an equal probability of selecting any particular item

• Sampling without replacement

• Once an object is selected, it is removed from the population

• Sampling with replacement

• A selected object is not removed from the population

• Stratified sampling:

• Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of the data)

• Used in conjunction with skewed data

24

Sampling: With or without Replacement

Raw Data

25

Sampling: Cluster or Stratified Sampling

Raw Data

Cluster/Stratified Sample

26

Data Transformation

• A function that maps the entire set of values of a given attribute to a new set

of replacement values s.t. each old value can be identified with one of the new

values

• Methods

• Smoothing: Remove noise from data

• Attribute/feature construction

• New attributes constructed from the given ones

• Aggregation: Summarization, data cube construction

• Normalization: Scaled to fall within a smaller, specified range

• min-max normalization

• z-score normalization

• normalization by decimal scaling

• Discretization: Concept hierarchy climbing

27

Normalization

• Min-max normalization: to [new_minA, new_maxA]

v − minA

v’ =

(new _ maxA − new _ minA) + new _ minA

maxA − minA

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is

73,600 − 12,000

mapped to

(1.0 − 0) + 0 = 0.716

98,000 − 12,000

• Z-score normalization (μ: mean, σ: standard deviation):

v’ =

v − A

A

• Ex. Let μ = 54,000, σ = 16,000. Then

• Normalization by decimal scaling

v

v’ = j

10

73,600 − 54,000

= 1.225

16,000

Where j is the smallest integer such that Max(|ν’|) < 1
28
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification
29
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., X^2) analysis (unsupervised, bottom-up merge)
30
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same number
of samples
• Good data scaling
• Managing categorical attributes can be tricky
31
Simple Discretization: Binning
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
32
Simple Discretization: Binning
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
33
Discretization Without Using Class Labels
(Binning vs. Clustering)
Equal frequency (binning)
K-means clustering leads to better results
34
Discretization by Classification & Correlation Analysis
• Classification (e.g., decision tree analysis)
• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Details to be covered in Chapter 7
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition
35
Concept Hierarchy Generation for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at the schema
level by users or experts
• street < city < state < country
• Specification of a hierarchy for a set of values by explicit data grouping
• {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
• E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by the analysis of the
number of distinct values
• E.g., for a set of attributes: {street, city, state, country}
36
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the analysis of the
number of distinct values per attribute in the data set
• The attribute with the most distinct values is placed at the lowest level of
the hierarchy
• Exceptions, e.g., weekday, month, quarter, year
country
15 distinct values
province_or_ state
365 distinct values
city
3567 distinct values
street
674,339 distinct values
37
Measuring Data Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
38
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p dimensions
• Two modes
• Dissimilarity matrix
• n data points, but registers only the
distance
• A triangular matrix
• Single mode
39
x11
...
x
i1
...
x
n1
... x1f
... ...
...
xif
...
...
... xnf
0
d(2,1)
0
d(3,1) d ( 3,2)
:
:
d ( n,1) d ( n,2)
... x1p
... ...
... xip
... ...
... xnp
0
:
...
... 0
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a
binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables
m
d (i, j) = p −
p
• Method 2: Use a large number of binary attributes
• creating a new binary attribute for each of the M nominal states
40
Proximity Measure for Binary Attributes
• A contingency table for binary data
Object i
• Distance measure for symmetric binary
variables:
• Distance measure for asymmetric binary
variables:
• Jaccard coefficient (similarity measure
for asymmetric binary variables):
•
Note: Jaccard coefficient is the same as “coherence”:
41
Object j
Dissimilarity between Binary Variables
• Example
Name
Jack
Mary
Jim
Gender Fever Cough Test-1
M
Y
N
P
F
Y
N
P
M
Y
P
N
Test-2
N
N
N
• Gender is a symmetric attribute
• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0
0+1
= 0.33
2+ 0+1
1+1
d ( jack , jim ) =
= 0.67
1+1+1
1+ 2
d ( jim , m ary) =
= 0.7542
1+1+ 2
d ( jack , m ary) =
Test-3
N
P
N
Test-4
N
N
N
Standardizing Numeric Data
−
• Z-score:
z = x
• X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
• the distance between the raw score and the population mean in units of the
standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
m f = 1n (x1 f + x2 f + ... + xnf )
xif − m f
where
zif =
sf
• standardized measure (z-score):
• Using mean absolute deviation is more robust than using standard deviation
.
43
Example: Data Matrix and Dissimilarity Matrix
Data Matrix
point
x1
x2
x3
x4
attribute1 attribute2
1
2
3
5
2
0
4
5
Dissimilarity Matrix
(with Euclidean Distance)
x1
x1
x2
x3
x4
44
x2
0
3.61
5.1
4.24
x3
0
5.1
1
x4
0
5.39
0
Distance on Numeric Data: Minkowski Distance
• Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)

• d(i, j) = d(j, i) (Symmetry)

• d(i, j) d(i, k) + d(k, j) (Triangle Inequality)

• A distance that satisfies these properties is a metric

45

Special Cases of Minkowski Distance

• h = 1: Manhattan (city block, L1 norm) distance

• E.g., the Hamming distance: the number of bits that are different between two binary vectors

d (i, j) =| x − x | + | x − x | +…+ | x − x |

i1

j1

i2

j2

ip

jp

• h = 2: (L2 norm) Euclidean distance

d (i, j) =

(| x − x |2 + | x − x |2 +…+ | x − x |2 )

i1

j1

i2

j2

ip

jp

• h → . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component (attribute) of the vectors

46

Example: Data Matrix and Dissimilarity Matrix

point

x1

x2

x3

x4

Manhattan (L1)

L

x1

x2

x3

x4

x1

0

5

3

6

x2

x3

x4

0

6

1

0

7

0

x2

x3

x4

Euclidean (L2)

L2

x1

x2

x3

x4

x1

0

3.61

2.24

4.24

0

5.1

1

0

5.39

0

Supremum

L

x1

x2

x3

x4

x1

x2

0

3

2

3

x3

0

5

1

x4

0

5

0

47

attribute 1 attribute 2

1

2

3

5

2

0

4

5

Ordinal Variables

• An ordinal variable can be discrete or continuous

• Order is important, e.g., rank

• Can be treated like interval-scaled

• replace xif by their rank

rif {1,…, M f }

• map the range of each variable onto [0, 1] by replacing i-th object in the f-th

variable by

rif −1

zif =

M f −1

• compute the dissimilarity using methods for interval-scaled variables

48

Attributes of Mixed Type

• A database may contain all attribute types

• Nominal, symmetric binary, asymmetric binary, numeric, ordinal

• One may use a weighted formula to combine their effects

• f is binary or nominal:

pf = 1 ij( f ) dij( f )

d (i, j) =

pf = 1 ij( f )

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

• f is numeric: use the normalized distance

• f is ordinal

r

−1

zif =

M −1

if

• Compute ranks rif and

• Treat zif as interval-scaled

f

49

Cosine Similarity

• A document can be represented by thousands of attributes, each recording the frequency of a

particular word (such as keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …

• Applications: information retrieval, biologic taxonomy, gene feature mapping, …

• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then

cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

where • indicates vector dot product, ||d||: the length of vector d

50

Example: Cosine Similarity

• cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

where • indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)

d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25

||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5

cos(d1, d2 ) = 0.94

51

= 4.12

Required Reading

1. Data Mining: Concepts and Techniques, Chapter 3: Data

Preprocessing 3.4-end

2. Data Mining: Concepts and Techniques, Chapter 2 (Getting to

Know Your Data) 2.4

Recommended Reading

1. Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques,

Chapter 8 – Data Transformation

52

This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)

Thank You

53

الجامعة السعودية االلكترونية

الجامعة السعودية االلكترونية

1

26/12/2021

College of Computing and Informatics

Bachelor of Science in Information Technology

IT446

Data Mining and Data Warehousing

2

IT446

Data Mining and Data Warehousing

Chapter 4 (Data Warehousing and On-line Analytical

Processing)

Week 5

3

Week Learning Outcomes

▪ Recognize data warehouse definition and how it differs from other

database systems

▪ Demonstrate data warehouse modeling: data cube and OLAP.

▪ Describe data warehouse design and usage.

▪ Perform data warehouse implementation.

▪ Apply data generalization by attribute-oriented induction.

4

Chapter 4: Data Warehousing and On-line Analytical Processing

▪ Basic Concepts of Data Warehouse

▪ Data Warehouse Modeling: Data Cube and OLAP

▪ Data Warehouse Design and Usage

▪ Data Warehouse Implementation

▪ Data Generalization by Attribute-Oriented Induction

5

What is a Data Warehouse?

• Defined in many different ways, but not rigorously.

• A decision support database that is maintained separately from the organization’s

operational database

• Support information processing by providing a solid platform of consolidated,

historical data for analysis.

• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

collection of data in support of management’s decision-making process.”—W. H. Inmon

• Data warehousing:

• The process of constructing and using data warehouses

6

Data Warehouse—Subject-Oriented

• Organized around major subjects, such as customer, product,

sales

• Focusing on the modeling and analysis of data for decision

makers, not on daily operations or transaction processing

• Provide a simple and concise view around particular subject

issues by excluding data that are not useful in the decision

support process

7

Data Warehouse—Integrated

• Constructed by integrating multiple, heterogeneous data sources

• relational databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied.

• Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data

sources

• E.g., Hotel price: currency, tax, breakfast covered, etc.

• When data is moved to the warehouse, it is converted.

8

Data Warehouse—Time Variant

• The time horizon for the data warehouse is significantly longer than that of

operational systems

• Operational database: current value data

• Data warehouse data: provide information from a historical perspective

(e.g., past 5-10 years)

• Every key structure in the data warehouse

• Contains an element of time, explicitly or implicitly

• But the key of operational data may or may not contain “time element”

9

Data Warehouse—Nonvolatile

• A physically separate store of data transformed from the operational

environment

• Operational update of data does not occur in the data warehouse

environment

• Does not require transaction processing, recovery, and concurrency

control mechanisms

• Requires only two operations in data accessing:

• initial loading of data and access of data

10

OLTP and OLAP

• The two terms look similar but refer to different kinds of

systems.

• Online transaction processing (OLTP) captures, stores, and

processes data from transactions in real time.

• Online analytical processing (OLAP) uses complex queries to

analyze aggregated historical data from OLTP systems.

11

OLTP vs. OLAP

12

Why a Separate Data Warehouse?

• High performance for both systems

• DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery

• Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation

• Different functions and different data:

• missing data: Decision support requires historical data which operational DBs do not typically

maintain

• data consolidation: DS requires consolidation (aggregation, summarization) of data from

heterogeneous sources

• data quality: different sources typically use inconsistent data representations, codes and formats

which have to be reconciled

• Note: There are more and more systems which perform OLAP analysis directly on relational databases

13

Data Warehouse: A Multi-Tiered Architecture

Other

sources

Operational

DBs

Metadata

Extract

Transform

Load

Refresh

Monitor

&

Integrator

Data

Warehouse

OLAP Server

Serve

Analysis

Query

Reports

Data mining

Data Marts

Data Sources

Data Storage

OLAP Engine

14

Front-End Tools

Three Data Warehouse Models

• Enterprise warehouse

• collects all of the information about subjects spanning the entire

organization

• Data Mart

• a subset of corporate-wide data that is of value to a specific groups of

users. Its scope is confined to specific, selected groups, such as

marketing data mart

• Independent vs. dependent (directly from warehouse) data mart

• Virtual warehouse

• A set of views over operational databases

• Only some of the possible summary views may be materialized

15

Extraction, Transformation, and Loading (ETL)

• Data extraction

• get data from multiple, heterogeneous, and external sources

• Data cleaning

• detect errors in the data and rectify them when possible

• Data transformation

• convert data from legacy or host format to warehouse format

• Load

• sort, summarize, consolidate, compute views, check integrity, and build indicies

and partitions

• Refresh

• propagate the updates from the data sources to the warehouse

16

Metadata Repository

• Meta data is the data defining warehouse objects. It stores:

• Description of the structure of the data warehouse

• schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

• Operational meta-data

• data lineage (history of migrated data and transformation path), currency of data (active, archived, or

purged), monitoring information (warehouse usage statistics, error reports, audit trails)

• The algorithms used for summarization

• The mapping from operational environment to the data warehouse

• Data related to system performance

• warehouse schema, view and derived data definitions

• Business data

• business terms and definitions, ownership of data, charging policies

17

From Tables and Spreadsheets to Data Cubes

• A data warehouse is based on a multidimensional data model which views data in the form of a data

cube

• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

• Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter,

year)

• Fact table contains measures (such as dollars_sold) and keys to each of the related dimension

tables

• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,

which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms

a data cube.

18

Cube: A Lattice of Cuboids

all

time

item

0-D (apex) cuboid

location

supplier

1-D cuboids

time,location

item,location

location,supplier

2-D cuboids

time,supplier

item,supplier

time,location,supplier

3-D cuboids

time,item,supplier

item,location,supplier

4-D (base) cuboid

19

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures

• Star schema: A fact table in the middle connected to a set of

dimension tables

• Snowflake schema: A refinement of star schema where some

dimensional hierarchy is normalized into a set of smaller dimension

tables, forming a shape similar to snowflake

• Fact constellations: Multiple fact tables share dimension tables,

viewed as a collection of stars, therefore called galaxy schema or fact

constellation

20

Example of Star Schema

time

item

time_key

day

day_of_the_week

month

quarter

year

Sales Fact Table

time_key

item_key

branch_key

branch

location_key

branch_key

branch_name

branch_type

units_sold

dollars_sold

avg_sales

Measures

21

item_key

item_name

brand

type

supplier_type

location

location_key

street

city

state_or_province

country

Example of Snowflake Schema

time

time_key

day

day_of_the_week

month

quarter

year

item

Sales Fact Table

time_key

item_key

branch_key

branch

location_key

branch_key

branch_name

branch_type

units_sold

dollars_sold

avg_sales

Measures

22

item_key

item_name

brand

type

supplier_key

supplier

supplier_key

supplier_type

location

location_key

street

city_key

city

city_key

city

state_or_province

country

Example of Fact Constellation

time

time_key

day

day_of_the_week

month

quarter

year

item

item_key

item_name

brand

type

supplier_type

Sales Fact Table

time_key

item_key

location_key

branch_key

branch_name

branch_type

time_key

item_key

shipper_key

from_location

branch_key

branch

Shipping Fact Table

units_sold

dollars_sold

avg_sales

Measures

23

location

to_location

location_key

street

city

province_or_state

country

dollars_cost

units_shipped

shipper

shipper_key

shipper_name

location_key

shipper_type

A Concept Hierarchy: Dimension (location)

all

all

Europe

region

country

city

office

Germany

Frankfurt

…

…

…

Spain

North_America

Canada

Vancouver …

L. Chan

24

…

…

Mexico

Toronto

M. Wind

Data Cube Measures: Three Categories

• Distributive: if the result derived by applying the function to n aggregate values is the

same as that derived by applying the function on all the data without partitioning

• E.g., count(), sum(), min(), max()

• Algebraic: if it can be computed by an algebraic function with M arguments (where M

is a bounded integer), each of which is obtained by applying a distributive aggregate

function

• E.g., avg(), min_N(), standard_deviation()

• Holistic: if there is no constant bound on the storage size needed to describe a

subaggregate.

• E.g., median(), mode(), rank()

25

Multidimensional Data

• Sales volume as a function of product, month, and region

Dimensions: Product, Location, Time

Hierarchical summarization paths

Industry Region

Year

Product

Category Country Quarter

Product

City

Office

Month

26

Month Week

Day

A Sample Data Cube

2Qtr

3Qtr

4Qtr

sum

U.S.A

Canada

Mexico

sum

27

Country

TV

PC

VCR

sum

1Qtr

Date

Total annual sales

of TVs in U.S.A.

Cuboids Corresponding to the Cube

all

0-D (apex) cuboid

product

product,date

date

country

product,country

1-D cuboids

date, country

2-D cuboids

3-D (base) cuboid

product, date, country

28

Typical OLAP Operations

• Roll up (drill-up): summarize data

• by climbing up hierarchy or by dimension reduction

• Drill down (roll down): reverse of roll-up

• from higher level summary to lower level summary or detailed data, or

introducing new dimensions

• Slice and dice: project and select

• Pivot (rotate):

• reorient the cube, visualization, 3D to series of 2D planes

• Other operations

• drill across: involving (across) more than one fact table

• drill through: through the bottom level of the cube to its back-end relational

tables (using SQL)

29

OLAP Operations Examples

30

OLAP Operations Examples

31

OLAP Operations Examples

32

OLAP Operations Examples

33

A Star-Net Query Model

Customer Orders

Shipping Method

Customer

CONTRACTS

AIR-EXPRESS

ORDER

TRUCK

PRODUCT LINE

Time

Product

ANNUALY QTRLY

DAILY

PRODUCT ITEM PRODUCT GROUP

CITY

SALES PERSON

COUNTRY

DISTRICT

REGION

Location

Each circle is

called a footprint

DIVISION

Promotion

34

Organization

Browsing a Data Cube

• Visualization

• OLAP capabilities

• Interactive manipulation

35

Design of Data Warehouse: A Business Analysis Framework

• Four views regarding the design of a data warehouse

• Top-down view

• allows selection of the relevant information necessary for the data warehouse

• Data source view

• exposes the information being captured, stored, and managed by operational

systems

• Data warehouse view

• consists of fact tables and dimension tables

• Business query view

• sees the perspectives of data in the warehouse from the view of end-user

36

Data Warehouse Design Process

• Top-down, bottom-up approaches or a combination of both

• Top-down: Starts with overall design and planning (mature)

• Bottom-up: Starts with experiments and prototypes (rapid)

• From software engineering point of view

• Waterfall: structured and systematic analysis at each step before

proceeding to the next

• Spiral: rapid generation of increasingly functional systems, short turn

around time, quick turn around

• Typical data warehouse design process

• Choose a business process to model, e.g., orders, invoices, etc.

• Choose the grain (atomic level of data) of the business process

• Choose the dimensions that will apply to each fact table record

• Choose the measure that will populate each fact table record

37

Data Warehouse Development: A Recommended Approach

Multi-Tier Data

Warehouse

Distributed

Data Marts

Data

Mart

Enterprise

Data

Warehouse

Data

Mart

Model refinement

Model refinement

Define a high-level corporate data model

38

Data Warehouse Usage

• Three kinds of data warehouse applications

• Information processing

• supports querying, basic statistical analysis, and reporting using crosstabs,

tables, charts and graphs

• Analytical processing

• multidimensional analysis of data warehouse data

• supports basic OLAP operations, slice-dice, drilling, pivoting

• Data mining

• knowledge discovery from hidden patterns

• supports associations, constructing analytical models, performing classification

and prediction, and presenting the mining results using visualization tools

39

From On-Line Analytical Processing (OLAP) to On Line Analytical

Mining (OLAM)

• Why online analytical mining?

• High quality of data in data warehouses

• DW contains integrated, consistent, cleaned data

• Available information processing structure surrounding data warehouses

• ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP

tools

• OLAP-based exploratory data analysis

• Mining with drilling, dicing, pivoting, etc.

• On-line selection of data mining functions

• Integration and swapping of multiple mining functions, algorithms,

and tasks

40

Efficient Data Cube Computation

• Data cube can be viewed as a lattice of cuboids

• The bottom-most cuboid is the base cuboid

• The top-most cuboid (apex) contains only one cell

• How many cuboids in an n-dimensional cube with L levels?

• Materialization of data cube

n

T = ( Li +1)

i =1

• Materialize every (cuboid) (full materialization), none (no materialization), or some

(partial materialization)

• Selection of which cuboids to materialize

• Based on size, sharing, access frequency, etc.

41

The “Compute Cube” Operator

• Cube definition and computation in DMQL

define cube sales [item, city, year]: sum (sales_in_dollars)

compute cube sales

• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)

()

SELECT item, city, year, SUM (amount)

FROM SALES

(city)

(item)

(year)

(city, item)

(city, year)

(item, year)

CUBE BY item, city, year

• Need compute the following Group-Bys

(date, product, customer),

(date,product),(date, customer), (product, customer),

(date), (product), (customer)

(city, item, year)

()

42

Indexing OLAP Data: Bitmap Index

• Index on a particular column

• Each value in the column has a bit vector: bit-op is fast

• The length of the bit vector: # of records in the base table

• The i-th bit is set if the i-th row of the base table has the value for the indexed column

• not suitable for high cardinality domains

•

A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high

cardinality domain as well [Wu, et al. TODS’06]

Base table

Index on Region

Index on Type

Cust Region Type RecIDAsia Europe America RecID Retail Dealer

C1 Asia

Retail

1

1

0

1

1

0

0

C2 Europe Dealer 2

2

0

1

0

1

0

C3 Asia

Dealer 3

1

0

0

3

0

1

4

0

0

1

4

1

0

C4 America Retail

0

1

0

5

0

1

C5 Europe Dealer 5

43

Indexing OLAP Data: Join Indices

• Join index: JI(R-id, S-id) where R (R-id, …) S (S-id, …)

• Traditional indices map the values to a list of record ids

• It materializes relational join in JI file and speeds up

relational join

• In data warehouses, join index relates the values of the

dimensions of a start schema to rows in the fact table.

• E.g. fact table: Sales and two dimensions city and product

• A join index on city maintains for each distinct city a

list of R-IDs of the tuples recording the Sales in the city

• Join indices can span multiple dimensions

44

Efficient Processing OLAP Queries

• Determine which operations should be performed on the available cuboids

• Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection

• Determine which materialized cuboid(s) should be selected for OLAP op.

• Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are

4 materialized cuboids available:

1) {year, item_name, city}

2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

Which should be selected to process the query?

• Explore indexing structures and compressed vs. dense array structs in MOLAP

45

OLAP Server Architectures

• Relational OLAP (ROLAP)

• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

• Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional

tools and services

• Greater scalability

• Multidimensional OLAP (MOLAP)

• Sparse array-based multidimensional storage engine

• Fast indexing to pre-computed summarized data

• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

• Flexibility, e.g., low level: relational, high-level: array

• Specialized SQL servers (e.g., Redbricks)

• Specialized support for SQL queries over star/snowflake schemas

46

Attribute-Oriented Induction

• Proposed in 1989 (KDD ‘89 workshop)

• Not confined to categorical data nor particular measures

• How it is done?

• Collect the task-relevant data (initial relation) using a relational database

query

• Perform generalization by attribute removal or attribute generalization

• Apply aggregation by merging identical, generalized tuples and

accumulating their respective counts

• Interaction with users for knowledge presentation

47

Attribute-Oriented Induction: An Example

Example: Describe general characteristics of graduate students in the University

database

• Step 1. Fetch relevant set of data using an SQL statement, e.g.,

Select * (i.e., name, gender, major, birth_place, birth_date,

residence, phone#, gpa)

from student

where student_status in {“Msc”, “MBA”, “PhD” }

• Step 2. Perform attribute-oriented induction

• Step 3. Present results in generalized relation, cross-tab, or rule forms

48

Class Characterization: An Example

Name

Jim

Initial

Relation Woodman

Gender

Major

M

CS

Scott

Lachance

Laura Lee

…

M

F

…

Removed

Retained

Gender Major

Prime

Generalized

Relation

M

F

…

Birth-Place

Birth_date

Residence

Phone #

GPA

Vancouver,BC, 8-12-76

Canada

CS

Montreal, Que, 28-7-75

Canada

Physics Seattle, WA, USA 25-8-70

…

…

…

3511 Main St.,

Richmond

345 1st Ave.,

Richmond

687-4598

3.67

253-9106

3.70

125 Austin Ave.,

Burnaby

…

420-5232

…

3.83

…

Sci,Eng,

Bus

City

Removed

Excl,

VG,..

Country

Age range

Birth_region

Age_range

Residence

GPA

Canada

Foreign

…

20-25

25-30

…

Richmond

Burnaby

…

Very-good

Excellent

…

Canada

Foreign

Total

M

16

14

30

F

10

22

32

Total

26

36

62

Science

Science

…

Birth_Region

Gender

49

Count

16

22

…

Basic Principles of Attribute-Oriented Induction

• Data focusing: task-relevant data, including dimensions, and the result is the initial

relation

• Attribute-removal: remove attribute A if there is a large set of distinct values for A

but (1) there is no generalization operator on A, or (2) A’s higher level concepts are

expressed in terms of other attributes

• Attribute-generalization: If there is a large set of distinct values for A, and there

exists a set of generalization operators on A, then select an operator and generalize

A

• Attribute-threshold control: typical 2-8, specified/default

• Generalized relation threshold control: control the final relation/rule size

50

Attribute-Oriented Induction: Basic Algorithm

• InitialRel: Query processing of task-relevant data, deriving the initial relation.

• PreGen: Based on the analysis of the number of distinct values in each attribute,

determine generalization plan for each attribute: removal? or how high to

generalize?

• PrimeGen: Based on the PreGen plan, perform generalization to the right level to

derive a “prime generalized relation”, accumulating the counts.

• Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping

into rules, cross tabs, visualization presentations.

51

Presentation of Generalized Results

• Generalized relation:

• Relations where some or all attributes are generalized, with counts or other aggregation values

accumulated.

• Cross tabulation:

• Mapping results into cross tabulation form (similar to contingency tables).

• Visualization techniques:

• Pie charts, bar charts, curves, cubes, and other visual forms.

• Quantitative characteristic rules:

• Mapping generalized result into characteristic rules with quantitative information associated

with it, e.g.,

grad ( x) male( x)

birth _ region( x) =”Canada”[t :53%] birth _ region( x) =” foreign”[t : 47%].

52

Mining Class Comparisons

•

Comparison: Comparing two or more classes

•

Method:

•

Partition the set of relevant data into the target class and the contrasting class(es)

•

Generalize both classes to the same high level concepts

•

Compare tuples with the same high level descriptions

•

Present for every tuple its description and two measures

•

•

•

support – distribution within single class

•

comparison – distribution between classes

Highlight the tuples with strong discriminant features

Relevance Analysis:

•

Find attributes (features) which best distinguish different classes

53

Concept Description vs. Cube-Based OLAP

• Similarity:

Data generalization

• Presentation of data summarization at multiple levels of abstraction

• Interactive drilling, pivoting, slicing and dicing

• Differences:

• OLAP has systematic preprocessing, query independent, and can drill down to

rather low level

• AOI has automated desired level allocation, and may perform dimension

relevance analysis/ranking when there are many relevant dimensions

• AOI works on the data which are not in relational forms

•

54

Required Reading

1. Data Mining: Concepts and Techniques, chapter 4: Data

Warehousing and Online Analytical Processing

Recommended Reading

1. W. H. Inmon. 2002. Building the Data Warehouse, 3rd Edition (3rd. ed.). John Wiley

& Sons, Inc., USA.

2. Data Warehouse Tutorial for Beginners: Learn in 7 Days

https://www.guru99.com/data-warehousing-tutorial.html

55

This Presentation is mainly dependent on the textbook: Data Mining: Concepts and Techniques (3rd ed.)

Thank You

56

الجامعة السعودية االلكترونية

الجامعة السعودية االلكترونية

26/12/2021

1

College of Computing and Informatics

Bachelor of Science in Information Technology

IT446

Data Mining and Data Warehousing

2

IT446

Data Mining and Data Warehousing

Chapter 6 (Mining Frequent Patterns, Associations, and

Correlations)

Week 6

3

Week Learning Outcomes

▪ Describe types of frequent itemsets.

▪ Explain the process of generating association rules.

▪ Employ Frequent itemsets algorithms.

4

Chapter 6: Mining Frequent Patterns, Associations, and Correlations

▪ Basic Concepts of Mining Frequent Patterns

▪ Frequent Itemset Mining Methods

▪ Which Patterns Are Interesting?—Pattern Evaluation

Methods

5

What Is Frequent Pattern Analysis?

• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data

set

•

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association

rule mining

• Motivation: Finding inherent regularities in data

• What products were often purchased together?— Beer and diapers?!

• What are the subsequent purchases after buying a PC?

• What kinds of DNA are sensitive to this new drug?

• Can we automatically classify web documents?

•

Applications

•

Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream)

analysis, and DNA sequence analysis.

6

Why Is Freq. Pattern Mining Important?

• Freq. pattern: An intrinsic and important property of datasets

• Foundation for many essential data mining tasks

• Association, correlation, and causality analysis

• Sequential, structural (e.g., sub-graph) patterns

• Pattern analysis in spatiotemporal, multimedia, time-series, and

stream data

• Classification: discriminative, frequent pattern analysis

• Cluster analysis: frequent pattern-based clustering

• Data warehousing: iceberg cube and cube-gradient

• Semantic data compression: fascicles

• Broad applications

7

Basic Concepts: Association Rules

Tid

Items bought

10

Beer, Nuts, Diaper

20

Beer, Coffee, Diaper

30

Beer, Diaper, Eggs

40

Nuts, Eggs, Milk

50

Nuts, Coffee, Diaper, Eggs, Milk

Customer

buys both

Customer

buys diaper

• Find all the rules X → Y with minimum

support and confidence

• support, s, probability that a

transaction contains X Y

• confidence, c, conditional probability

that a transaction having X also

contains Y

Let minsup = 50%, minconf = 50%

Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,

Diaper}:3

◼

Customer

buys beer

Association rules: (many more!)

◼ Beer → Diaper (60%, 100%)

◼ Diaper → Beer (60%, 75%)

8

Closed Patterns and Max-Patterns

• A long pattern contains a combinatorial number of sub-patterns, e.g., {a1,

…, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 subpatterns!

• Solution: Mine closed patterns and max-patterns instead

• An itemset X is closed if X is frequent and there exists no super-pattern Y כ

X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)

• An itemset X is a max-pattern if X is frequent and there exists no frequent

super-pattern Y כX (proposed by Bayardo @ SIGMOD’98)

• Closed pattern is a lossless compression of freq. patterns

• Reducing the # of patterns and rules

9

Closed Patterns and Max-Patterns

• Exercise. DB = {, < a1, …, a50>}

• Min_sup = 1.

• What is the set of closed itemset?

• : 1

• < a1, …, a50>: 2

• What is the set of max-pattern?

• : 1

• What is the set of all patterns?

• !!

10

Computational Complexity of Frequent Itemset Mining

• How many itemsets are potentially to be generated in the worst case?

• The number of frequent itemsets to be generated is senstive to the minsup threshold

• When minsup is low, there exist potentially an exponential number of frequent itemsets

• The worst case: MN where M: # distinct items, and N: max length of transactions

• The worst case complexty vs. the expected probability

• Ex. Suppose Walmart has 104 kinds of products

• The chance to pick up one product 10-4

• The chance to pick up a particular set of 10 products: ~10-40

• What is the chance this particular set of 10 products to be frequent 103 times in 109

transactions?

11

Scalable Frequent Itemset Mining Methods

• Apriori: A Candidate Generation-and-Test Approach

• Improving the Efficiency of Apriori

• FPGrowth: A Frequent Pattern-Growth Approach

• ECLAT: Frequent Pattern Mining with Vertical Data Format

12

The Downward Closure Property and Scalable Mining

Methods

• The downward closure property of frequent patterns

• Any subset of a frequent itemset must be frequent

• If {beer, diaper, nuts} is frequent, so is {beer, diaper}

• i.e., every transaction having {beer, diaper, nuts} also contains {beer,

diaper}

• Scalable mining methods: Three major approaches

• Apriori (Agrawal & Srikant@VLDB’94)

• Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)

• Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

13

Apriori: A Candidate Generation & Test Approach

• Apriori pruning principle: If there is any itemset which is infrequent, its

superset should not be generated/tested! (Agrawal & Srikant @VLDB’94,

Mannila, et al. @ KDD’ 94)

• Method:

• Initially, scan DB once to get frequent 1-itemset

• Generate length (k+1) candidate itemsets from length k frequent itemsets

• Test the candidates against DB

• Terminate when no frequent or candidate set can be generated

14

The Apriori Algorithm—An Example

Database TDB

Tid

Items

10

A, C, D

20

B, C, E

30

A, B, C, E

40

B, E

Supmin = 2

Itemset

{A, C}

{B, C}

{B, E}

{C, E}

sup

{A}

2

{B}

3

{C}

3

{D}

1

{E}

3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

sup

1

2

1

2

3

2

C1

1st scan

C2

L2

Itemset

sup

2

2

3

2

Itemset

sup

{A}

2

{B}

3

{C}

3

{E}

3

L1

C2

2nd scan

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

C3

Itemset

{B, C, E}

L3

3rd scan

15

Itemset

sup

{B, C, E}

2

The Apriori Algorithm (Pseudo-Code)

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return k Lk;

16

Implementation of Apriori

• How to generate candidates?

• Step 1: self-joining Lk

• Step 2: pruning

• Example of Candidate-generation

• L3={abc, abd, acd, ace, bcd}

• Self-joining: L3*L3

• abcd from abc and abd

• acde from acd and ace

• Pruning:

• acde is removed because ade is not in L3

• C4 = {abcd}

17

How to Count Supports of Candidates?

• Why counting supports of candidates a problem?

• The total number of candidates can be very huge

• One transaction may contain many candidates

• Method:

• Candidate itemsets are stored in a hash-tree

• Leaf node of hash-tree contains a list of itemsets and counts

• Interior node contains a hash table

• Subset function: finds all the candidates contained in a transaction

18

Counting Supports of Candidates Using Hash Tree

Subset function

3,6,9

1,4,7

Transaction: 1 2 3 5 6

2,5,8

1+2356

234

567

13+56

145

136

12+356

124

457

125

458

19

159

345

356

357

689

367

368

Candidate Generation: An SQL Implementation

• SQL Implementation of candidate generation

• Suppose the items in Lk-1 are listed in an order

• Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
• Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation
[See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational
database systems: Alternatives and implications. SIGMOD’98]
20
Further Improvement of the Apriori Method
• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates
21
Partition: Scan Database Only Twice
• Any itemset that is potentially frequent in DB must be frequent in at least one of the
partitions of DB
• Scan 1: partition database and find local frequent patterns
• Scan 2: consolidate global frequent patterns
• A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1
+
sup1(i) < σDB1
DB2
+
+
sup2(i) < σDB2
DBk
supk(i) < σDBk
22
=
DB
sup(i) < σDB
DHP: Reduce the Number of Candidates
• A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
• {ab, ad, ae}
• {bd, be, de}
• …
• Frequent 1-itemset: a, b, d, e
itemsets
35
88
{ab, ad, ae}
{bd, be, de}
.
.
.
• Hash entries
count
.
.
.
• Candidates: a, b, c, d, e
102
{yz, qs, wt}
Hash Table
• ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold
• J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95
23
Sampling for Frequent Patterns
• Select a sample of original database, mine frequent patterns within sample using
Apriori
• Scan database once to verify frequent itemsets found in sample, only borders of
closure of frequent patterns are checked
• Example: check abcd instead of ab, ac, …, etc.
• Scan database again to find missed frequent patterns
• H. Toivonen. Sampling large databases for association rules. In VLDB’96
24
DIC: Reduce Number of Scans
ABCD
• Once both A and D are determined frequent,
the counting of AD begins
• Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
ABC ABD ACD BCD
AB
AC
BC
AD
BD
CD
Transactions
A
B
C
D
1-itemsets
2-itemsets
…
Apriori
{}
Itemset lattice
1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
DIC
and S. Tsur. Dynamic itemset
counting and implication rules for
market basket data. SIGMOD’97
3-items
25
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
min_support = 3
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
Item frequency head
4
f
4
c
3
a
3
b
3
m
3
p
1. Scan DB once, find
frequent 1-itemset (single
item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again, construct
FP-tree
F-list = f-c-a-b-m-p
26
{}
f:4
c:1
c:3
b:1 b:1
a:3
p:1
m:2
b:1
p:2
m:1
Partition Patterns and Databases
• Frequent patterns can be partitioned into subsets according to
f-list
• F-list = f-c-a-b-m-p
• Patterns containing p
• Patterns having m but no p
•…
• Patterns having c but no a nor b, m, p
• Pattern f
• Completeness and non-redunden...
Purchase answer to see full
attachment

#### Why Choose Us

- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee

#### How it Works

- Click on the “Place Your Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "
**PAPER DETAILS**" section. - Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “
**CREATE ACCOUNT & SIGN IN**” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page. - From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.