【中文班】大数据技术

发布时间：2019-09-04 浏览次数：268

《大数据技术》教学大纲

课程代码	045102751
课程名称	大数据技术
英文名称	Big Data Technology
课程类别	专业领域课
课程性质	选修
学时	总学时：40 上机学时：12 实验学时：0实践学时：0
学分	2.5
开课学期	第六学期
开课单位	计算机科学与工程学院
适用专业	计算机科学技术、网络工程、信息安全
授课语言	中文授课
先修课程	计算机网络，操作系统，程序设计，数据库
毕业要求（专业培养能力）	本课程对学生达到如下毕业要求有如下贡献： 1.思政建设：实现计算机专业知识教学与立德树人教育的有机融合；激发学生“实干兴邦”的爱国奋斗精神。 2.工程知识：掌握扎实的基础知识、专业基本原理、方法和手段，能够将应用数学、自然科学、本专业基础知识和专业知识用于解决大数据的管理和分析计算问题，为大数据技术应用和相关工程实践打下基础。 3.问题分析：能够应用数学、自然科学和工程科学的基本原理，识别、表达、并通过文献研究分析大数据应用工程中的复杂问题，以获得有效结论。 4.设计/开发解决方案：能够设计针对大数据应用工程复杂问题的解决方案，包括满足特定需求的大数据系统设计、关键技术选择、应用工程实施流程或方案设计，并能够在设计环节中体现创新意识，考虑社会、健康、安全、法律、文化以及环境等因素。 5.研究：能够基于科学原理并采用科学方法对大数据应用工程复杂问题进行研究，包括设计实验、分析与解释数据、并通过信息综合得到合理有效的结论。 6.使用现代工具：能够针对大数据应用工程复杂问题，开发、选择与使用恰当的技术、资源、现代工程工具和信息技术工具，包括对复杂问题的预测与模拟，并能够理解其局限性。
课程培养学生的能力（教学目标）	完成课程后，学生将具备以下能力：（1）掌握分布式计算技术、大数据的分析计算模型、存储平台、分析处理技术、编程开发技术的基本知识，培养学生发现问题、解决问题的基本能力。［1、2］（2）掌握大数据存储管理、加工处理和分析计算的基本原理和基本技术，学生具有大数据的分析管理基本能力。［1、3、4］（3）掌握常用的大数据编程和应用开发技术，并具有初步大数据应用系统设计能力，培养学生的大数据技术应用实践能力。［3、5］
课程简介	本课程主要面向有一定的计算机网络、操作系统、程序设计和数据库基础知识，并且具有一定软件开发能力的高年级学生。课程主要介绍传统分布式计算的基本原理和基本开发技术，大数据存储管理和平台架构技术，大数据计算模型和分析处理算法原理，以及大数据系统构建和应用开发技术。课程需要学生阅读大量的相关文献来获得对技术的理解，还要求学生通过完成一系列实验来掌握大数据编程实践和分析处理技术方法及工具。通过本课程的学习，希望学生能够在了解和掌握大数据管理平台和分析处理技术的基础上，学会应用大数据处理技术解决现实数据处理、分析和应用问题。课程的知识模块包括分布式计算基础知识、分布式计算编程技术、大数据存储平台技术、大数据的计算模型、大数据分析处理技术、大数据编程开发技术、大数据应用开发技术七个方面。
教学内容与学时分配	（一）绪论课程目的、意义与内容组织、学时安排介绍2学时教学要求：要求掌握课程的主要目的与任务，了解大数据技术在计算机应用中的作用。（二）分布式计算基础知识2学时（1）分布式计算概念1学时（2）分布式计算模式1学时教学要求：要求掌握分布式计算的定义和优缺点，了解经典的分布式计算项目，掌握分布式计算模式的原理和概念，包括并行计算、网络计算、对等计算、网格计算、云计算、雾计算和大数据计算等。重点：各种分布式计算模式的原理难点：理解各种分布式计算模式的区别与联系（三）分布式计算编程技术6学时（1）进程间通信0.5学时（2）Socket编程 2学时（3）RMI编程 2学时（4）P2P编程 1.5学时教学要求：本章要求学生掌握进程间通信基本概念和原理，熟悉Socket API的基本概念和Socket编程方法，理解RMI和P2P范型，熟悉RMI和P2P编程的基本方法。重点：分布式计算编程的基本方法。难点：进程间通信的原理，Socket编程、RMI和P2P编程的应用方法。（四）大数据概念与存储技术4学时（1）大数据的背景与相关概念1学时（2）大数据存储技术3学时 a)分布式存储基础知识 b)大数据存储概念与技术原理 c)大数据存储平台与系统重点：分布式存储的基本知识和原理、大数据存储的技术原理和平台。难点：各种分布式存储技术原理、大数据存储技术原理。（五）大数据的计算模型6学时（1）传统并行计算模型介绍，包括PRAM模型、BSP模型和LogP模型等1学时（2）MapReduce计算模型 2学时（3）分布式内存计算模型 2学时（4）大数据流式计算 1学时重点：MapReduce计算模型及其大数据分析应用方法，分布式内存计算模型Spark及其大数据分析应用方法。难点：基于MapReduce计算模型的大数据分析计算，基于分布式内存计算模型Spark的大数据分析计算。（六）大数据分析处理技术7学时（1）大数据分析处理平台介绍1学时（2）Hadoop平台原理与生态系统 2学时（3）Impala原理与平台 2学时（4）阿里大数据平台2学时重点：Hadoop、Impala和阿里大数据平台的系统架构和技术原理、及基本使用方法。难点：理解Hadoop平台技术原理和Impala技术原理。（七）大数据编程开发技术7学时（1）HDFS基本使用方法和编程1学时（2）MapReduce大数据并行计算方法2学时（3）HBase数据库的开发方法 1学时（4）Hive数据仓库开发方法 1 学时（4）Spark编程方法2学时重点：HDFS的数据存储方法、MapReduce和Spark数据处理方法难点：设计基于MapReduce的大数据并行计算程序和基于Spark的大数据并行计算程序。（八）大数据应用开发技术6学时（1）大数据应用研究与发展方向介绍 1学时（2）大数据应用开发技术与方法1学时（3）实时医疗大数据分析案例1学时（4）保险大数据分析案例分析 1学时（5）生物信息大数据计算案例2学时重点：大数据技术在大数据应用开发的使用方法，大数据分析计算方法。难点：大数据应用的系统设计与开发方法。
实验教学（包括上机学时、实验学时、实践学时）	有
教学方法	课程教学以课堂教学、课外作业、实验教学、网络以及授课教师的科研项目于积累等共同实施。
考核方式	本课程注重过程考核，成绩比例为：平时作业和课堂表现：20% 课程实验（实验报告）：20% 期末考试（闭卷）：60%
教材及参考书	建议教材：林伟伟，刘波编著《分布式计算、云计算与大数据》，机械工业出版社，2017年，第二版次。主要参考资料： [1] 杨正洪著，《大数据技术入门》，清华大学出版社，2016 [2] 林子雨编著，《大数据技术原理与应用（第2版）》，人民邮电出版社出版，2017. [3] 张良均等著，《Hadoop大数据分析与挖掘实战》，机械工业出版社，2015 [4] M.L. Liu著，《分布式计算原理和应用》，清华大学出版社，2004 [5]孙宇熙著，《云计算与大数据》，人民邮电出版社，2017 [6]刘鹏著，《大数据》，电子工业出版社，2017
制定人及制定时间	林伟伟，2017年7月6日

“Big Data Technology” Syllabus

Course Code	045102751
Course Title	Big Data Technology
Course Category	Specialty-related Course
Course Nature	Elective Course
Class Hours	Total: 40 laboratorial practice: 12 experiments: 0 field practice: 0
Credits	2.5
Semester	Sixth term
Institute	School of Computer Science and Technology
ProgramOriented	Computer Science and Engineering, Network Engineering, Information Science
Teaching Language	Chinese
Prerequisites	“Computer Network”, “Operation System”, “Program designing”, “Database System”
Student Outcomes (Special Training Ability)	This course contributes to the students’ ability from the aspects as follows: 1.Ideological and political construction: realize the organic integration of computer professional knowledge teaching and moral education; Inspire the students' patriotic spirit of "making the country prosperous by doing". 2. Engineering knowledge: students will learn the fundamental knowledge, basic professional principles, methodologies and techniques. Students will be trained to solve the problems in big data management and process by applying mathematics and their professional knowledge in the scope of computer science. The course enhances students’ ability to develop big data applications. 3. Problem analysis: students will learn to define, express and analyze the comprehensive problems in big data engineering by doing survey and applying mathematics, engineering techniques and their professional knowledge in the scope of computer science. 4. Problem solving: students will learn how to find the comprehensive solutions to the problems in big data engineering including the design of big data system, selection of critical techniques, implementation of workflows and planning. Students are promoted in innovative awareness through considering multiple factors (e.g., society, environment and security) in their designs. 5. Research ability: students will learn to do research on the problems in big data engineering by adopting scientific methodologies including experiments, data analysis and conclusion making. 6. Utilizing modern techniques: students will learn to select, utilize and develop tools and techniques available to anticipate and simulate problems in big data engineering.
Teaching Objectives	After finishing the course: (1) Students should master the basic knowledge of distributed computing techniques, big data processing models, storage platforms, programming techniques and be trained in problem discovering and resolving. [I, II] (2) Students should master the basic methods and techniques for storing, processing and analyzing big data. [II, III, IV] (3) Students should master widely-used big data programming and be trained in designing and programming simple big data systems. [III, V]
Course Description	This course is prepared for upperclassmen who have a good mastery of the basics of computer network, operating system, program design and database as well as have capability to develop an application. The objective of this course is to introduce the basic principles and development technology of traditional distributed computing, the storage and management of big data, platform for big data, the model of big data computing, principles of algorithm to analyze big data and how to design a framework for big data system as well as the application development technology. Students in this course should to read a lot of relevant literature about big data, in order to form a perception of the technology. Besides, students need to do some experiment which is necessary to master how to use tools to analyze and program for big data. We hope student can discover, solve and apply the technology of big data during the real work instead of just knowing the basic principles of managing big data platforms or the way to analyze. The knowledge modules of the course include basic knowledge of distributed computing, technology of distributed computing programming, technology of big data storage platform, computational model for big data, big data analysis and processing technology, technology of big data programing development, and technology of big data application development.
Teaching Content and Class Hours Distribution	I. Introduction about the course2 hours Main content: Knowledge about the basic tasks, main targets of the course and the application of big data technology in computer science. II. Foundation of Distributed Computing2hours (1) Concepts in Distributed Computing1 hour (2) Distributed Computing Paradigm 1 hour Main content: The definition of distributed computing, its advantages/shortcomings, classical projects of distributed computing, basic concepts and theories in distributed computing (e.g., parallel computing, network computing, P2P computing, grid computing, cloud computing, fog computing and big data). Focus: foundations of different distributed computing models. Difficult points: understanding the difference and association between different models. III. Programming in Distributed Computing 6 hours (1) Inter-Process Communication (IPC) 0.5 hour (2) Socket programming 2 hours (3) RMI programming 2 hours (4) P2P programming 1.5 hours Main Content: basic concepts and principles of IPC, Socket API foundation and Socket programming, concepts of RMI and P2P, programming basics for RMI and P2P. Focus: programming frameworks of distributed computing. Difficult Points: the principles of IPC, Socket programming, the application of RMI and P2P. IV. Big Data and Storage techniques 4 hours Background and basic concepts 1 hour Storage techniques for big Data 3 hours Basic knowledge of distributed storage systems Concepts and principles of big data storage Big data storage platform and system Focus: basic knowledge and principles in distributed storage systems, technical principles and platforms of big data storage. Difficult Points: Various principles of distributed storage systems and big data storage. V. Big Data Computing Models 6 hours (1) Traditional parallel computing models (PRAM, BSP, LogP, etc.) 1 hour (2) MapReduce model 2 hours (3) Distributed memory model 2 hours (4) Big data stream processing 1 hour Focus: MapReduce model and its application to big data analysis, distributed memory computing model (Spark) and its application to big data analysis. Difficult Points: Big data analysis using MapReduce model and distributed memory computing model (Spark) VI. Big Data Processing Techniques 7 hours Big data processing platforms 1 hour Hadoop platform and its eco-system 2 hours Impala platform 2 hours Ali big data platform Focus: System architectures and the basic application of Hadoop, Impala and Ali big data platforms. Difficult Points: understanding the technical principles of Hadoop platform and Impala platform. VII. Big Data Programming 7 hours Basic operations and programming of HDFS 1 hour MapReduce Parallel programming 2 hours Developing HBase applications 1 hour Developing Hive applications 1 hour Spark programming 2 hours Focus: The data storage model of HDFS, data processing models of MapReduce and Spark. Difficult Points: designing parallel computing programs based on MapReduce and Spark. VIII. Techniques in Big Data Applications Development 6 hours Research trends of big data 1 hour Development techniques and methods of big data 1 hour Case study: online medical big data analysis 1 hour Case study: insurance big data analysis 1 hour Case study: biology big data analysis 2 hours Focus: Exploiting big data techniques in developing big data applications, methods for big data analysis. Difficult Points: System design and development techniques of big data applications.
Experimental Teaching	Yes
Teaching Method	Combining lectures, assignments, laboratorial tasks, online activities and the research projects of the lecturer.
Examination Method	The final score comprises of three parts with specified weights: Assignments and attendance: 20% Laboratorial tasks (with reports): 20% Final exam: 60%
Teaching Materials and Reference Books	Suggested Textbooks: 林伟伟，刘波编著《分布式计算、云计算与大数据》，机械工业出版社，2017年，第二版次。 Main References: [1] 杨正洪著，《大数据技术入门》，清华大学出版社，2016 [2] 林子雨编著，《大数据技术原理与应用（第2版）》，人民邮电出版社出版，2017. [3] 张良均等著，《Hadoop大数据分析与挖掘实战》，机械工业出版社，2015 [4] M.L. Liu著，《分布式计算原理和应用》，清华大学出版社，2004 [5]孙宇熙著，《云计算与大数据》，人民邮电出版社，2017 [6]刘鹏著，《大数据》，电子工业出版社，2017
Prepared by Whom and When	Lin Weiwei, 6 July 2017.

《大数据技术》实验教学大纲

课程代码	045102751
课程名称	大数据技术
英文名称	Big Data Technology
课程类别	专业领域课
课程性质	选修
学时	总学时：40 上机学时：12 实验学时：0实践学时：0
学分	2.5
开课学期	第六学期
开课单位	计算机科学与工程学院
适用专业	计算机科学技术、网络工程、信息安全
授课语言	中文授课
先修课程	计算机网络，操作系统，程序设计，数据库
毕业要求（专业培养能力）	本课程对学生达到如下毕业要求有如下贡献： 1.工程知识：掌握扎实的基础知识、专业基本原理、方法和手段，能够将应用数学、自然科学、本专业基础知识和专业知识用于解决大数据的管理和分析计算问题，为大数据技术应用和相关工程实践打下基础。 2.问题分析：能够应用数学、自然科学和工程科学的基本原理，识别、表达、并通过文献研究分析大数据应用工程中的复杂问题，以获得有效结论。 3.设计/开发解决方案：能够设计针对大数据应用工程复杂问题的解决方案，包括满足特定需求的大数据系统设计、关键技术选择、应用工程实施流程或方案设计，并能够在设计环节中体现创新意识，考虑社会、健康、安全、法律、文化以及环境等因素。 4.研究：能够基于科学原理并采用科学方法对大数据应用工程复杂问题进行研究，包括设计实验、分析与解释数据、并通过信息综合得到合理有效的结论。 5.使用现代工具：能够针对大数据应用工程复杂问题，开发、选择与使用恰当的技术、资源、现代工程工具和信息技术工具，包括对复杂问题的预测与模拟，并能够理解其局限性。
课程培养学生的能力（教学目标）	完成课程后，学生将具备以下能力：（1）掌握分布式计算技术、大数据的分析计算模型、存储平台、分析处理技术、编程开发技术的基本知识，培养学生发现问题、解决问题的基本能力。［1、2］（2）掌握大数据存储管理、加工处理和分析计算的基本原理和基本技术，学生具有大数据的分析管理基本能力。［1、3、4］（3）掌握常用的大数据编程和应用开发技术，并具有初步大数据应用系统设计能力，培养学生的大数据技术应用实践能力。［3、5］
课程简介	本课程主要面向有一定的计算机网络，操作系统，程序设计和数据库基础知识，并且具有一定软件开发能力的高年级学生。课程主要介绍传统分布式计算的基本原理和基本开发技术，大数据存储管理和平台架构技术，大数据计算模型和分析处理算法原理，以及大数据系统构建和应用开发技术。课程需要学生阅读大量的相关文献来获得对技术的理解，还要求学生通过完成一系列实验来掌握大数据编程实践和分析处理技术方法及工具。通过本课程的学习，希望学生能够在了解和掌握大数据管理平台和分析处理技术的基础上，学会应用大数据处理技术解决现实数据处理、分析和应用问题。课程的知识模块包括分布式计算基础知识、分布式计算编程技术、大数据存储平台技术、大数据的计算模型、大数据分析处理技术、大数据编程开发技术、大数据应用开发技术七个方面。
主要仪器设备与软件	设备：PC服务器软件：Java开发环境软件、Hadoop生态软件等
实验报告	要求给出实验的方法、步骤、过程和结论。
考核方式	实验报告：50％实验操作：50％
教材、实验指导书及教学参考书目	建议教材：林伟伟，刘波编著《分布式计算、云计算与大数据》，机械工业出版社，2017年，第二版次。主要参考资料： [1] 杨正洪著，《大数据技术入门》，清华大学出版社，2016 [2] 林子雨编著，《大数据技术原理与应用（第2版）》，人民邮电出版社出版，2017. [3] 张良均等著，《Hadoop大数据分析与挖掘实战》，机械工业出版社，2015 [4] M.L. Liu著，《分布式计算原理和应用》，清华大学出版社，2004 [5]孙宇熙著，《云计算与大数据》，人民邮电出版社，2017 [6]刘鹏著，《大数据》，电子工业出版社，2017
制定人及发布时间	林伟伟，2017年7月6日

《大数据技术》实验教学内容与学时分配

实验项目编号	实验项目名称	实验学时	实验内容提要	实验类型	实验要求	每组人数	主要仪器设备与软件
1	分布式计算程序设计	4	基于Socket API或Java RMI客户服务器通信程序，通过客户端程序对服务器程序的调用，实现简单信息查询功能（如对服务器的文件信息查询）。	设计性	必做	1	PC机、JAVA开发环境
2	大数据基本操作	4	掌握分布式文件系统HDFS的文件基本操作，熟悉MapReduce程序运行方法，掌握HBase数据库基本操作和Hive数据仓库基础使用，并能设计简单的大数据存储程序（如HDFS或HBase数据存储与读取程序）。	演示性	必做	1-2	PC服务器、Hadoop生态软件
3	日志大数据分析计算	4	使用MapReduce或Hive工具分析日志大数据（如手机用户上网日志数据），实现日志的基本查询和统计功能（如通过统计用户上网日志数据TOP URL功能，实现用户上网偏好分析）。	综合性	必做	1-2	PC服务器、Hadoop生态软件

“Big Data Technology” Experiment Syllabus

Course Code	045102751
Course Title	Big Data Technology
Course Category	Specialty-related Course
Course Nature	Elective Course
Class Hours	Total: 40 laboratorial practice: 12 experiments: 0 field practice: 0
Credits	2.5
Semester	Sixth term
Institute	School of Computer Science and Technology
Program Oriented	Computer Science and Engineering, Network Engineering, Information Science
Teaching Language	Chinese
Prerequisites	“Computer Network”, “Operation System”, “Program designing” , “Database System”
Student Outcomes (Special Training Ability)	This course contributes to the students’ ability from the aspects as follows: 1. Engineering knowledge: students will learn the fundamental knowledge, basic professional principles, methodologies and techniques. Students will be trained to solve the problems in big data management and process by applying mathematics and their professional knowledge in the scope of computer science. The course enhances students’ ability to develop big data applications. 2. Problem analysis: students will learn to define, express and analyze the comprehensive problems in big data engineering by doing survey and applying mathematics, engineering techniques and their professional knowledge in the scope of computer science. 3. Problem solving: students will learn how to find the comprehensive solutions to the problems in big data engineering including the design of big data system, selection of critical techniques, implementation of workflows and planning. Students are promoted in innovative awareness through considering multiple factors (e.g., society, environment and security) in their designs. 4. Research ability: students will learn to do research on the problems in big data engineering by adopting scientific methodologies including experiments, data analysis and conclusion making. 5. Utilizing modern techniques: students will learn to select, utilize and develop tools and techniques available to anticipate and simulate problems in big data engineering.
Teaching Objectives	After finishing the course: (1) Students should master the basic knowledge of distributed computing techniques, big data processing models, storage platforms, programming techniques and be trained in problem discovering and resolving. [I, II] (2) Students should master the basic methods and techniques for storing, processing and analyzing big data. [II, III, IV] (3) Students should master widely-used big data programming and be trained in designing and programming simple big data systems. [III, V]
Course Description	This course is prepared for upperclassmen who have a good mastery of the basics of computer network, operating system, program design and database as well as have capability to develop an application. The objective of this course is to introduce the basic principles and development technology of traditional distributed computing, the storage and management of big data, platform for big data, the model of big data computing, principles of algorithm to analyze big data and how to design a framework for big data system as well as the application development technology. Students in this course should to read a lot of relevant literature about big data, in order to form a perception of the technology. Besides, students need to do some experiment which is necessary to master how to use tools to analyze and program for big data. We hope student can discover, solve and apply the technology of big data during the real work instead of just knowing the basic principles of managing big data platforms or the way to analyze. The knowledge modules of the course include basic knowledge of distributed computing, technology of distributed computing programming, technology of big data storage platform, computational model for big data, big data analysis and processing technology, technology of big data programing development, and technology of big data application development.
Instruments and Equipments	Equipment: PC server Software: Java Development Kit、Hadoop Development Environment
Experiment Report	The method, procedure, process and conclusion of experiment are required
Assessment	Experiment Report: 50% Experimental Operation: 50%
Teaching Materials and Reference Books	Suggested Textbooks: 林伟伟，刘波编著《分布式计算、云计算与大数据》，机械工业出版社，2017年，第二版次。 Main References: [1] 杨正洪著，《大数据技术入门》，清华大学出版社，2016 [2] 林子雨编著，《大数据技术原理与应用（第2版）》，人民邮电出版社出版，2017. [3] 张良均等著，《Hadoop大数据分析与挖掘实战》，机械工业出版社，2015 [4]林伟伟，彭绍亮. 云计算与大数据技术理论及应用. 清华大学出版社. 2019.07 [5]孙宇熙著，《云计算与大数据》，人民邮电出版社，2017 [6]刘鹏著，《大数据》，电子工业出版社，2017
Prepared by Whom and When	Lin Weiwei, 6 July 2017.

“Big Data Technology” Experimental Teaching Arrangements

No.	Experiment Item	Class Hours	Content Summary	Category	Requirements	Number of StudentsEach Group	Instruments, Equipments and Software
1	Distributed Computing Program Design	4	Preparing Client/Server’s communication program with Socket API or Java RMI, and realize the simple function of information inquiry (e.g. query the information of files on the server)	Design	Compulsory	1	PC\Java Development Environment
2	Basic Operation of Big Data	4	Master the basic operation of distributed file system HDFS, be familiar with how the program of MapReduce run, and master the basic operation of HBase database and how to use Hive data warehouse, as well as be able to design a simple program for big data storage (e.g. the program to read or store data from HDFS or HBase)	Demonstration	Compulsory	1-2	PC Server\ Hadoop Development Environment
3	The Analysis and Computing of Massive Log Data	4	Query and analyze the log data by using the tools of MapReduce or Hive which are designed for this (e.g. discover the preference of users when their surfing the Internet by analyzing the TOP URL in the log data)	Comprehensive	Compulsory	1-2	PC Server\ Hadoop Development Environment