《数据仓库与数据挖掘》教学大纲

课程代码

045100931

课程名称

数据仓库与数据挖掘

英文名称

Data Warehouse and Data Mining

课程类别

选修课

课程性质

选修

学时

总学时: 48  实验学时: 16

学分

2.5

开课学期

第七学期

开课单位

计算机科学与工程学院

适用专业

计算机科学与技术、网络工程、信息安全

授课语言

中英双语授课

先修课程

高级语言程序设计、算法设计与分析

课程对毕业要求的支撑

本课程对学生达到如下毕业要求有如下贡献:

1)工程知识:能够将数学、自然科学、工程基础和专业知识用于解决复杂工程问题

2问题分析:能够应用数学、自然科学和工程科学的基本原理,识别、表达、并通过文献研究分析复杂工程问题,以获得有效结论

3研究:能够基于科学原理并采用科学方法对复杂工程问题进行研究,包括设计实验、分析与解释数据、并通过信息综合得到合理有效的结论

课程目标

完成课程后,学生将具备以下能力:

1)熟练掌握数据仓库与数据挖掘的基本原理、常用算法、领域专业英语词汇,熟练应用主流数据挖掘与机器学习开发平台,为解决以后学习或工作中遇到的实际数据分析问题奠定坚实的知识基础。[123]

2)针对实际问题,基于本课程所学知识体系,能够分析该问题所需挖掘的知识类型,通过查阅、研读国内外最新相关文献,提供切实可行的解决方案。[12]

课程简介

本课程是一门培养学生具有一定数据分析能力的选修课。课程的主要目的是让学生掌握数据仓库与数据挖掘基本概念与算法,针对实际工作与应用中产生的大数据,用数据挖掘技术来发现数据中隐藏的知识或规律,从而为生产、生活、商务活动、社会活动等提供决策支持。要求学生通过本课程的学习,认识数据仓库和数据挖掘在当今大数据时代中的重要作用,了解数据仓库的基本原理和实现方法,掌握数据预处理技术和数据挖掘常用算法(包括关联分析、分类与预测、聚类分析、链接分析、数据摘要等),为解决实际问题打下坚实的知识基础。

教学内容与学时分配

(一)课程目的、意义与内容组织、学时安排介绍1学时)

思政要素:以《国务院关于印发新一代人工智能发展规划的通知》、教育部关于印发《高等学校人工智能创新行动计划》的通知、《广东省新一代人工智能创新发展行动计划(20182020年)》为纲,阐述本课程在新一代人工智能发展中的作用与意义,激发学生“实干兴邦”的爱国奋斗精神

教学要求:要求掌握课程的主要目的与任务,了解数据挖掘在新人工智能时代中的重要作用。


(二)数据仓库与数据挖掘的基本概念(2学时)

1)关系数据库及高级数据库

2)数据挖掘的知识功能类型

3)模式评价

4)数据挖掘系统的基本架构

5)数据挖掘面临的一些主要问题

6)领域主要文献源

教学要求:要求掌握数据挖掘的知识类型、如何评价挖掘出的知识、数据挖掘系统设计的层次架构,及在大数据时代数据挖掘面临的主要问题。

重点:不同知识类型的异同点分析。

难点:知识评价的主观与客观标准。


(三)数据预处理3学时)

1)数据预处理的必要性与意义

2)数据清洗、数据集成和数据变换

3)数据约简与离散化

教学要求:要求掌握预处理的必要性,及数据预处理的常用方法。

重点:数据归一化变换、基于熵的有监督离散化。

难点:数据约简的降维与采样技术。


(四)数据仓库和联机分析技术3学时)

1)数据仓库的定义

2)数据仓库设计模式

3)数据仓库的实施

教学要求:要求掌握数据仓库与常用的关系数据库的不同点,数据仓库的三种设计模式,汇聚计算函数的分类,物化计算的空间与时间复杂性分析。

重点:数据立方体、OLTPOLAP的不同点分析。

难点:物化计算的空间与时间复杂性分析。


(五)频繁模式与关联规则的挖掘3学时)

1)基本概念

2Apriori布尔单维关联规则挖掘算法

3)频繁模式树

4)基于Apriori的多维、量化关联规则挖掘算法

教学要求:要求熟练掌握Apriori算法

重点:Apriori算法的核心思想所在

难点:Apriori算法的时间与空间复杂性分析


(六)分类和预测9学时)

1)有监督学习、半监督与无监督学习基本概念、泛化误差的方差与偏差分解

2)决策树学习

3k-最近邻学习

4)贝叶斯学习

5)神经网络与深度学习

6)集成学习基本概念

7)样本复杂性基本概念

教学要求:要求熟练掌握有监督学习与无监督学习的区别,熟练掌握决策树、贝叶斯、最近邻学习算法,能够运用开源工具进行深度学习

重点:决策树、贝叶斯学习、深度学习及开源工具的利用

难点:泛化误差的偏差与方差分解、反向传播算法及样本复杂性


(七)聚类分析6学时)

1)聚类分析简介

2)算法复杂性、NP-难与NP-完全简介

3)基于划分的聚类算法:

      k-均值(k-means)、k-中心点(k-center)、k-中位点(k-median

      谱聚类算法

4)基于层次的聚类算法:

     全链接聚类算法

     单链接聚类算法

教学要求:要求熟练掌握常见的聚类准则,以及k-均值(k-means)、k-中心点(k-center)、k-中位点、谱聚类、层次化聚类算法

重点:聚类准则的合理性与有效性评价

难点:算法复杂性分析


(八)链接分析(3学时)

1)链接分析简介

2PageRank算法

3HITS算法

教学要求:要求深刻理解PageRankHITS算法的核心思想明确二者的优缺点,能够独立实现PageRankHITS算法。

重点:PageRank的数学模型与线性代数中特征值问题的关联性。

难点:马尔科夫建模及其收敛性


(九)数据摘要(2学时)

1)数据摘要的基本概念

2)文本摘要、视频摘要、音频摘要

3)基于最大覆盖模型的摘要算法

教学要求:明确数据摘要在大数据时代的必要性,能够透彻理解最大覆盖模型及其算法

重点:文本摘要的数学模型及近似算法。

难点:针对不同数据类型摘要的数学建模

实验教学(包括实验学时、实习学时、其他)

教学方法

课程教学以课堂讲授、课外作业、实验以及参与授课教师的科研项目等共同实施。

考核方式

本课程将平时表现与期末考试相结合进行考核,成绩比例为:

平时成绩:20%

实验成绩:10%

期末考试(闭卷):70%

教材及参考书

现用教材:

数据挖掘概念与技术,Jiawei Hand(韩家炜),M. Kamber, Jian Pei(裴健),北京:机械工业出版社(第三版),2012


主要参考资料:

[1] 周志华,机器学习,清华大学出版社,2016

[2] Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python, O'REILLY, 2012.

[3] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[4] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2015.

[5] 荫蒙(Inmon W.H)著,王志海 等 译,数据仓库(原书第四版),机械工业出版社,2011

[6] Charu C. Aggarwal. Neural Networks and Deep Learning. Springer, 2018.

制定人及制定时间

王家兵,201945


 “Data Warehouse and Data Mining”Syllabus

Course Code

045100931

Course Title

Data Warehouse and Data Mining

Course Category

Elective Courses

Course Nature

Elective Course

Class Hours

Class Hours: 48 Lab Hours: 16

Credits

2.5

Semester

7th

Institute

The School of Computer Science and Engineering

Program Oriented

Computer Science and Technology, Network Engineering, Information Security

Teaching Language

Bilingual teaching in Chinese and English

Prerequisites

Advanced Language Programming, The Design and Analysis of Computer Algorithms

Student Outcomes

 (Special Training Ability)

(1). Engineering Knowledge: An ability to apply knowledge of mathematics, science, engineering fundamentals and engineering specialization to the solution of complex engineering problems.

(2). Problem Analysis: An ability to identify, formulate and analyze complex engineering problems, reaching to substantiated conclusions using basic principles of mathematics, science, and engineering.

(3). Research: An ability to conduct investigations of complex engineering problems based on scientific theories and adopting scientific methods including design of experiments, analysis and interpretation of data and synthesis of information to provide valid conclusions.

Course Objectives

Upon completion of the course, students will have the following abilities:

(1) Master the basic principle of data warehouse, commonly used data mining algorithms, professional English vocabulary, and the mainstream data mining and machine learning platform; Lay a solid foundation for solving the practical problems. [1, 2, 3]

(2) For an application scenario, he (she) can analyze the required mining functionality and provide solutions. [1, 2]

Course Description

This course introduces the concepts and techniques of data warehouse and data mining. Data mining, also popularly referred to as knowledge discovery in databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the web, other big data. Contents include data preprocessing (data cleaning, data integration, data transformation, data reduction, and data discretization), data warehouse and OLAP technology, data warehouse implementation, mining frequent patterns and association rules, classification and prediction, cluster analysis, linkage analysis, data summarization, etc.

Teaching Content and Class Hours Distribution

1. The motivation, the organization of contents, and class hours assignment (1 hour)


2. The basic concepts about Data Warehouse and Data Mining (2 hours)

(1) Data Mining: On what kind of data? DBMS and advanced Databases

(2) Data mining functionality

(3) Patterns evaluation

(4) The architecture of data mining systems

(5) Major issues in data mining

(6) Top journals and conference for data mining and machine learning community

RequirementsMaster the mining functionality, pattern evaluation, the architecture  of data mining system, and the major issues in the era of big data.

Key points of teaching: Analysis of (dis)similarities of different mining functionalities.

Difficult points of teaching: Subjective and objective criteria of pattern evaluation.


3. Data preprocessing (3 hours)

(1) Why preprocess the data?

(2) Data cleaning

(3) Data integration and transformation

(4) Data reduction

(5) Data discretization

RequirementsMaster the commonly used methods of data preprocessing.

Key points of teaching: Data normalization and supervised discretization.

Difficult points of teaching: Data reduction by dimensionality reduction and sampling.


4. Data warehouse and OLAP (3 hours)

(1) The definition of data warehouse

(2) A multi-dimensional data model and design schemas

(3) Data warehouse architecture

(4) Data warehouse implementation

RequirementsMaster the differences between Data Warehouse and common relational databases (DBMS), three design schemas of data warehouse, the classification of aggregation functions, and the analysis of space and time complexity of materialization.

Key points of teaching: Data Cub, the differences between OLTP and OLAP.

Difficult points of teaching: The analysis of space and time complexity of materialization.


5. Mining frequent patters and association rules (3 hours)

(1) Basic concepts

(2) Single-dimensional Boolean association rules: Apriori algorithm

(3) Frequent-pattern tree

(4) Mining multi-dimensional and quantized association rules

RequirementsMaster the Apriori algorithm.

Key points of teaching: The core idea of Apriori algorithm.

Difficult points of teaching: The analysis of time and space complexity of Apriori algorithm.


6. Classification and prediction (9 hours)

(1) Supervised, semi-supervised and unsupervised learning, the bias-variation decomposition of generalization error

(2) Classification by decision tree induction

(3) k-nearest neighbor learning

(4) Bayesian Classification

(5) Neural Networks and deep learning

(6) Ensemble learning

(7) Sample complexity

RequirementsMaster the differences between supervised learning and unsupervised learning, decision tree, naive Bayes learning, and neural network and deep leaning. Familiar with open source tools for deep learning.

Key points of teaching:Decision tree, Bayes learning, deep learning, open source tools for deep learning.

Difficult points of teaching: The bias-variation decomposition of generalization error, back propagation, sample complexity.


7. Cluster Analysis(6 hours)

(1) What is Cluster Analysis?

(2) Complexity analysis: NP-complete and NP-hard

(3) A Categorization of Major Clustering Methods

(4) Partitioning Methods

 k-meansk-centerk-median

    Spectral clustering

(5) Hierarchical Methods

    Single-linkageclustering

     Complete-linkageclustering

RequirementsMaster the commonly used clustering algorithm: k-means, k-center, k-median, and Hierarchical clustering.

Key points of teaching: Rationality and validity evaluation of clustering criteria.

Difficult points of teaching: Complexity analysis.


8. Linkage Analysis (3 hours)

(1) A brief of linkage analysis

(2) The PageRank algorithm

(3) The HITS algorithm

RequirementsUnderstand the core ideas of PageRank and HITS algorithms, identify the strengths and weaknesses of the two algorithms, and can implement them independently.

Key points of teaching: The relation between the mathematical model of PageRank and the eigenvalue problem.

Difficult points of teaching: Markov chain modeling of internet surf and its convergence.


9. Data Summarization (2 hours)

(1) A brief of data summarization

(2) Text, video, and speech summarization

(3) The maximum coverage model for summarization

RequirementsUnderstand the necessity of data summaries in the big data era, and master the maximum coverage model.

Key points of teaching: The mathematical model and approximate algorithm for text summarization.

Difficult points of teaching: Mathematical modeling for summarization of different data types.

Experimental Teaching

Yes

Teaching Method

The course teaching is carried out through classroom teaching, computer experiment, and participation in research projects.

Examination Method

The final score consists of the following two parts:

(1) Performance in class: 20%

(2) Computer experiment: 10%

(3) Final examination (closed): 70%

Teaching Materials and Reference Books

Text book:

Jiawei Han, M. Kamber, Jian Pei. Data Mining: Concepts and Techniques (3rd Edition), 2012.


Reference books:

[1] Zhi-Hua Zhou, Machine Learning (in Chinese). Tsinghua University, Press, 2016.

[2] Steven Bird, Ewan Klein & Edward Loper. Natural Language Processing with Python. O'REILLY, 2012.

[3] Christopher M. Bishop. Pattern Recognition and Machine Learning, Springer, 2006.

[4] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2015.

[5] Inmon W.H. Building the Data Warehouse, 4th Edition, ISBN: 978-0-7645-9944-6. Wiley, 2005.

[6] Charu C. Aggarwal. Neural Networks and Deep Learning. Springer, 2018.

Prepared by Whom and When

Jiabing Wang, 04-05-2019


《数据仓库与数据挖掘》实验教学大纲

课程代码

045100931

课程名称

数据仓库与数据挖掘

英文名称

Data Warehouse and Data Mining

课程类别

选修课

课程性质

选修

学时

总学时:48  实验:16

学分

2.5

开课学期

第七学期

开课单位

计算机科学与工程学院

适用专业

计算机科学与技术、网络安全、信息安全

授课语言

中英双语授课

先修课程

高级语言程序设计、算法设计与分析

毕业要求(专业培养能力)

本课程对学生达到如下毕业要求有如下贡献:

1)工程知识:能够将数学、自然科学、工程基础和专业知识用于解决复杂工程问题

2问题分析:能够应用数学、自然科学和工程科学的基本原理,识别、表达、并通过文献研究分析复杂工程问题,以获得有效结论

3研究:能够基于科学原理并采用科学方法对复杂工程问题进行研究,包括设计实验、分析与解释数据、并通过信息综合得到合理有效的结论

课程培养学生的能力(教学目标)

本实验课程培养学生在掌握课堂所学理论知识(算法)基础上的实际动手能力,能够使用高级程序设计语言独立编程实现数据预处理、关联规则挖掘、分类、聚类分析、链接分析、数据摘要等算法,并用实际数据验证挖掘效果。[1, 2, 3]

课程简介

本课程是一门培养学生具有一定数据分析能力的选修课。课程的主要目的是让学生掌握数据仓库与数据挖掘基本概念与算法,针对实际工作与应用中产生的大数据,用数据挖掘技术来发现数据中隐藏的知识或规律,从而为生产、生活、商务活动、社会活动等提供决策支持。要求学生通过本课程的学习,认识数据仓库和数据挖掘在当今大数据与新人工智能时代中的重要作用,了解数据仓库的基本原理和实现方法,熟练掌握数据预处理技术和数据挖掘常用算法(包括关联分析、分类与预测、聚类分析、链接分析、数据摘要等)及其程序设计,能够用所学知识解决实际数据分析问题。

主要仪器设备与软件

配置有Python等高级程序设计语言的个人计算机。

实验报告

实验报告应包括以下主要内容,并以电子版提交:

1)实验任务描述

2)实验目的

3)实验数据的描述(包括数据的来源、数据的特征,如应用领域、数据集的大小、特征的数据类型、特征数目等)

4)实验过程

5)实验结果及分析(包括得出的结论、存在的问题及可能的改进方向等)

6)程序源代码

考核方式

实验成绩由以下三部分综合评定:

1)程序设计的正确性(40%

2)实验结果的合理性(30%

3)实验报告的规范性(30%

教材、实验指导书及教学参考书目

教材:

数据挖掘概念与技术,Jiawei Hand(韩家炜),M. Kamber, Jian Pei(裴健),北京:机械工业出版社(第三版),2012


实验指导书:自编

说明:由于数据挖掘领域发展的日新月异,新技术、新算法、新工具每年都大量涌现(仅每年的领域顶级学术会议ICMLNIPSKDD等就有上千篇论文发表),为了尽力保证相关实验的新颖性与先进性,可能每年实现的具体算法都有所不同,因此实验指导书每年再具体下达。


教学参考参考书目及网络资源:

[1] 周志华,机器学习,清华大学出版社,2016

[2] Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python, O'REILLY, 2012.

[3] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[4] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2015.

[5] 荫蒙(Inmon W.H) 著,王志海 等 译,数据仓库(原书第四版),机械工业出版社,2011

[6] Charu C. Aggarwal. Neural Networks and Deep Learning. Springer, 2018.

[7] 数据挖掘开源工具包:https://scikit-learn.org/stable/.

[8] 深度学习开源平台:https://keras.io/.

制定人及发布时间

王家兵,201945


《数据仓库与数据挖掘》实验教学内容与学时分配

实验项目编号

实验项目名称

实验学时

实验内容提要

实验类型

实验要求

每组人数

主要仪器设备与软件

1

分类算法实现

6

实现某一分类算法(具体算法在实验前下达),并用实际数据加以验证。

综合性

必做

1

配置有Python等高级程序设计语言的个人计算机

2

聚类算法实现

6

实现某一聚类算法(具体算法在实验前下达),并用实际数据加以验证。

综合性

必做

1

配置有Python等高级程序设计语言的个人计算机

3

链接分析或数据摘要算法实现

4

实现链接分析或数据摘要算法(具体算法在实验前下达),并用实际数据加以验证。

综合性

必做

1

配置有Python等高级程序设计语言的个人计算机


 “Data Warehouse and Data Mining”Syllabus

Course Code

045100931

Course Title

Data Warehouse and Data Mining

Course Category

Elective Course

Course Nature

Elective Course

Class Hours

Class Hours: 48 Lab Hours: 16

Credits

2.5

Semester

7th

Institute

The School of Computer Science and Engineering

Program Oriented

Computer Science and Technology, Network Engineering, Information Security

Teaching Language

Bilingual teaching in Chinese and English

Prerequisites

Advanced Language Programming, The Design and Analysis of Computer Algorithms

Student Outcomes (Special Training Ability)

(1). Engineering Knowledge: An ability to apply knowledge of mathematics, science, engineering fundamentals and engineering specialization to the solution of complex engineering problems.

(2). Problem Analysis: An ability to identify, formulate and analyze complex engineering problems, reaching to substantiated conclusions using basic principles of mathematics, science, and engineering.

(3). Research: An ability to conduct investigations of complex engineering problems based on scientific theories and adopting scientific methods including design of experiments, analysis and interpretation of data and synthesis of information to provide valid conclusions.

Course Objectives

The experimental course develops students' ability to implement data mining algorithms, such as data preprocessing, association rule mining, classification, clustering analysis, link analysis, data summarization, etc., using advanced programming language. [1, 2, 3]

Course Description

This course introduces the concepts and techniques of data warehouse and data mining. Data mining, also popularly referred to as knowledge discovery in databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the web, other big data. Contents include data preprocessing (data cleaning, data integration, data transformation, data reduction, and data discretization), data warehouse and OLAP technology, data warehouse implementation, the implementations of mining frequent patterns and association rules, classification and prediction, cluster analysis, linkage analysis and data summarization algorithms.

Instruments and Equipments

Personal Computer with Python programming language.

Experiment Report

The experimental report should include the following contents and submit in electronic form:

(1) Experimental task description

(2) The purpose of experiment

(3) The description of experimental data (including data sources, data characteristics, such as application areas, data set size, the data type, the number of features, etc.)

(4) The experimental setup

(5) The experimental results and analysis (including the conclusions drawn, the existing problems and possible directions for improvement)

(6) The source code

Assessment

The score consists of the following three parts:

(1) The correctness of programming (40%)

(2) The rationality of the experimental results (30%)

(3) The standardization of the experiment report (30%)

Teaching Materials and Reference Books

Text book:

Jiawei Han, M. Kamber, Jian Pei. Data Mining: Concepts and Techniques (3rd Edition). China Machine Press. 2012.


Reference books and internet resources:

[1] Zhi-Hua Zhou. Machine Learning (in Chinese). Tsinghua University, Press, 2016.

[2] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O'REILLY, 2012.

[3] Christopher M. Bishop. Pattern Recognition and Machine Learning, Springer, 2006.

[4] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press, 2015.

[5] Inmon W.H. Building the Data Warehouse, 4th Edition, ISBN: 978-0-7645-9944-6, Wiley, 2005.

[6] Charu C. Aggarwal. Neural Networks and Deep Learning. Springer, 2018.

[7] Open Source Package for Data Mining: https://scikit-learn.org/stable/.

[8] Open Source Package for Deep Learning: https://keras.io/.

Prepared by Whom and When

Jiabing Wang, 04-05-2019

 “Data Warehouse and Data Mining”Experimental Teaching Arrangements

No.

Experiment Item

Class Hours

Content Summary

Category

Requirements

Number of Students Each Group

Instruments, Equipments and Software

1

Implementation of Classification Algorithms

6

Implement a specified classification algorithm, and verify the implementation using actual data.

Comprehensive

Compulsory

1

Personal Computer with Python programming language

2

Implementation of Clustering Algorithms

6

Implement a specified clustering algorithm (the specific algorithm before the experiment), and verify the implementation using actual data.

Comprehensive

Compulsory

1

Personal Computer with Python programming language

3

Implementation of Linkage Analysis or Data Summarization Algorithms

4

Implement a linkage analysis or specified  data summarization algorithm, and verify the implementation using actual data.

Comprehensive

Compulsory

1

Personal Computer with Python programming language