<<

Cleveland State University Department of Electrical and Computer Engineering

CIS 612/712 Big & Parallel Processing Systems

Catalog Data : Big Data & Parallel Database Processing Systems (3-0-3). Prerequisites: CIS 505 and CIS 530. Detailed study of modern database processing and parallel database processing systems for big data processing. The topics include Transaction concept, concurrency control strategies, semi-structured and unstructured data processing strategies. The course advances the study with big data processing strategies on distributed file system Hadoop with Map Reduce paradigm and focuses on the study of massively parallel database processing systems for big data processing with selective NoSQL systems, NewSQL systems, and platforms and infrastructures. The course covers data model, index, querying techniques, data processing methods, and ACID (Atomicity, Consistency, Isolation, and Durability) issues in parallel database processing systems. The students will get hands-on experiences on big data processing systems with processing real time big data stream obtained from well-known social network sites. Finally, the course will explore the latest advances in industry for big data processing and data .

Textbooks : Fundamentals of Database Systems , by Elmasri / Navathe. 7 th Edision. Addison Wesley Pub Co.

Lecture Notes taken from the Database Research Papers and Industry Parallel Database Systems Documentations on Big Data Processing Systems and Data Analytics

References : Tutorials for Hadoop/Map Reduce and VM (Virtual Machine) Tutorials for NoSQL Systems - Hive, HBase, PigLatin, MongoDB, Cassandra Tutorials for New SQL System VoltDB

Coordinator: Dr. Sunnie S. Chung

Outcomes : Upon successful completion of the course, the student will be able to: • Understand a well-defined Transaction concept and concurrency control strategies in database processing systems; • Create modern database applications that process non-traditional data - semi- structured data such as JASON or XML data, or unstructured data such as web logging data; • Understand big data processing techniques and comprehensive knowledge on Massively Parallel Processing (MPP) systems - NoSQL/New SQL systems, and Cloud Computing; • Obtain hands-on experiences on parallel data processing systems and tools, and cloud computing platforms and infrastructures for big data processing; • Build an infrastructure for big data processing systems; • Exposed to the latest advances in database industry research in big data processing;

Topics Lecture Hours 1. Introduction to Big Data 3 Transaction, ACID (Atomicity, Consistency, Isolation, Durability) Concurrency Control 2. Database Programming Constructs 3 Database Triggers Stored Procedure, Embedded SQL, Dynamic SQL, JDBC/ODBC, PHP 3. User Defined Function (UDF), User Defined Type (UDT), 3 User Defined Aggregate (UDA), Table Function, Common Language Runtime (CLR) Functions and Types 4. Enhanced Data Models for Advanced Applications 3 Semi Structured and Unstructured : XML Data Processing, XPath, XQuery JavaScript Object Notation (JSON) Data Processing 5. Introduction to Retrieval and Web Data Processing 3 Data Models for Unstructured Big Data Processing ’s Big Table 6. Introduction to Big Data 3 Google’s Map Reduce Paradigm File System for Parallel Processing 7. Big Data Processing and Massively Parallel Processing Systems 3 NoSQL Systems: Pig Latin on Apache Hadoop by Yahoo and Apache HIVE with Hadoop by HBase 8. MongoDB 3 Cassandra 9. Key Value Stores 3 Map Reduce Join Algorithms 10. NewSQL System: 3 VoltDB Extended PDW with Map Reduce and Hadoop : Oracle, 11. NoSQL vs NewSQL 12. ACID of Massively Parallel Processing Systems 3 13. Cloud Computing: Platforms and Infrastructures 3 14. Advanced Research literature review and Presentations 3 15. Exams and Reviews 3 __ 45

Grading: The course grade is based on a student's overall performance through the entire Semester. The final grade is distributed among the following components: • Exams (Midterm & Final) 40% (15% Midterm, 20% Final) • Computer Labs 30% (about 4-5 Lab Assignments) • 1 Project on Big data processing: 2 person group project (25%) • Research Paper Presentation: 10%

Additional Requirements for CIS712 Students: • Doctoral students who take CIS712 must select a project to work on • Doctoral students who take CIS712 must work on the project individually (instead of 2 person group) • The list of projects and research papers for doctoral students will be given separately in class. A tentative example of the selection of the research projects and the paper list are given at the end of the course schedule here • In each exam, one additional problem is designed to be completed by doctoral students only

Computer Software Required : Installation and Set Up instruction details for each system will be given in class.

1. Visual Studio 2012/2013 or higher 2. SQL Server 2014 or higher 3. SQL Server Data Tools for Analysis Service 2014 or higher 4. Hadoop/MapReduce and VM 5. Hive 6. PigLatin 7. HBase 8. MongoDB 9. Cassandra 10. VoltDB

Tentative List of Research Papers and Projects for CIS 712 Doctoral Students: CIS 712 Doctoral Students should choose one of the following research topics and give a 30 min presentation on the papers (will be given in class) and complete a project related to the subjects. Paper List and Project Specification on each research topic below will be given in class.

Examples of Selective Current Database Research Topics in Big Data and Parallel Database Systems (The subjects and the paper list may vary every year.)

1. Semistructured/Unstructured Data Processing 2. Hadoop based Data Warehousing and Analytics Infrastructure at Facebook 3. Parallel Computing for Big Data Processing: • Google Cloud, Cloud • Hadoop Based NoSQL Systems • NewSQL Systems 4. MapReduce: Simplified Data Processing on Large Clusters by Google 5. Lammal, Ralf. Google's MapReduce Programming Model Revisited. 6. Stream Processing Sparks 7. NoSQL Systems: Pig Latin, HBase, Hive, Mongo DB, Cassandra 8. Map Reduce Join Algorithmes, 9. Data Partition Techniques 10. Performance Survey : SQL vs NoSQL 11. Processing MR/Hadoop with PDW : Oracle, Teradata 12. Information Retrieval: Google 13. Big Systems 14. Cloud Computing : , Amazon Cloud, Google Cloud

• Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, et al . (Yahoo! Research) in the proceedings of SIGMOD 2008 • Data Warehousing and Analytics Infrastructure at Facebook. by Ashish Thusoo, et al . (Facebook) in the proceedings of SIGMOD 2010 • Petabyte Scale Databases and Storage Systems Deployed at Facebook, Dhruba Borthakur, et al. in the proceedings of SIGMOD 2014 • Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture, Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc), SIGMOD 2014. • The “Big Data” Ecosystem at LinkedIn • Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn), SIGMOD 2015 • Avatara: OLAP for Webscale Analytics Products Lili Wu Roshan Sumbaly Chris Riccomini Gordon Koo Hyung Jin Kim Jay, Kreps Sam Shah (LinkedIn), SIGMOD 2014 • Microsoft Azure as a Self-Managing Database Service: Lessons Learned and Challenges Ahead by Kunal Mukerjee, et al (Microsoft) in the proceedings of IEEE Computer Society Technical Committee on Data Engineering 2014