Enterprise Data Analysis and Design

Enterprise Data Analysis and Design Lecture 1: Introduction to Database Technology Johannes Gehrke [email protected] http://www.cs.cornell.edu/johannes Course Goals • Architectures of modern enterprise information systems • Understand the functionality of modern database and data mining systems • Understand where database systems and data mining fit into an enterprise information system • Learn to ask the right questions • Learn how to use several important tools • Data modeling (DeZign for Databases) • Data mining (SAS Enterprise Miner) NBA 518 Spring 2004: Lecture 1 2 DeZign for Databases NBA 518 Spring 2004: Lecture 1 3 NBA 518: Enterprise Data Design and Analysis 1 SAS Enterprise Miner NBA 518 Spring 2004: Lecture 1 4 Course Outline • 1/26 Database Management Systems • 1/28 Enterprise Information Architectures • 2/2, 2/4, and 2/9: Data Modeling • 2/11, 2/16, 2/18, 2/23, and 2/25: Data Mining • 3/2 OLAP • 3/4 Web Services • 3/9 Future Trends NBA 518 Spring 2004: Lecture 1 5 Course Mechanics • Temporary course homepage: http://www.cs.cornell.edu/johannes/teaching/NBA518 • Slides will be online the morning before each lecture • Readings for each class will be available online • Office hours: • Tuesdays 1:30-2:30, Upson Hall 4105B • Mondays and Wednesdays from 2:00 – start of class in Sage Hall Atrium • Always welcome to ask questions via email ([email protected]) • Ask questions after the lecture NBA 518 Spring 2004: Lecture 1 6 NBA 518: Enterprise Data Design and Analysis 2 Grading • Five homework assignments: • Enterprise architectures (15%) • Data modeling (15%) • Data mining I: Classification (15%) • Data mining II: Clustering and Associations (15%) • A complete case study (in groups of 2-3 students, 20%) • Class participation (20%): Quality and not quantity counts NBA 518 Spring 2004: Lecture 1 7 Introduction: About the Instructor Johannes Gehrke is an Assistant Professor in the Department of Computer Science at Cornell University. He obtained his Ph.D. in computer science from the University of Wisconsin-Madison in 1999; his graduate studies were supported by a Fulbright fellowship and an IBM fellowship. Johannes' research interests are in the areas of data mining, data stream processing, and distributed data management for sensor networks and peer-to-peer networks. Johannes has received a National Science Foundation Career Award, an Arthur P. Sloan Fellowship, an IBM Faculty Award, and the Cornell College of Engineering James and Mary Tien Excellence in Teaching Award. He is the author of numerous publications on data mining and database systems, and he co-authored the undergraduate textbook Database Management Systems (McGrawHill (2002), currently in its third edition), used at universities all over the world. Johannes has served as Program Co-Chair of the 2001 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Tutorial Chair for the 2001 IEEE International Conference on Data Mining, Area Chair for the Twentieth International Conference on Machine Learning, co-Chair of the 2003 ACM SIGKDD Cup, and he is serving as Program co-Chair of the 2004 ACM SIGKDD Conference. Johannes has given courses and tutorials on data mining and data stream processing at international conferences and on Wall Street, and he has extensive industry experience as technical advisor. NBA 518 Spring 2004: Lecture 1 8 Introduction: Students NBA 518 Spring 2004: Lecture 1 9 NBA 518: Enterprise Data Design and Analysis 3 Goal of This Lecture • Understand the basic functionality of a database system NBA 518 Spring 2004: Lecture 1 10 The Big Picture WWW Site Internal User Visitor INTRANET, VPN THE WEB Internal Main Public Web Server Web Server Memory Cache Data Business Warehouse Transaction Application DBMS Server Server NBA 518 Spring 2004: Lecture 1 11 Why Database Systems? Discuss with your neighbor: What functionality is required from database systems in the following application scenarios: • EBay (www.ebay.com) • Barnes and Noble (www.bn.com) • General Motors (www.gm.com) • The Protein Data Bank (http://www.rcsb.org/pdb) • Sprint (www.sprint.com) • Your cell phone NBA 518 Spring 2004: Lecture 1 12 NBA 518: Enterprise Data Design and Analysis 4 Why Store Data in a DBMS? • Benefits • Transactions (concurrent data access, recovery from system crashes) • High-level abstractions for data access, manipulation, and administration • Data integrity and security • Performance and scalability NBA 518 Spring 2004: Lecture 1 13 A Digress – What Is a Transaction? The execution of a program that performs a function by accessing a database. Examples: • Reserve an airline seat. Buy an airline ticket. • Withdraw money from an ATM. • Verify a credit card sale. • Order an item from an Internet retailer. • Download a video clip and pay for it. • Play a bid at an on-line auction. NBA 518 Spring 2004: Lecture 1 14 Transactions • A transaction is an atomic sequence of actions • Each transaction must leave the system in a consistent state (if system is consistent when the transaction starts). • The ACID Properties: • Atomicity • Consistency • Isolation • Durability NBA 518 Spring 2004: Lecture 1 15 NBA 518: Enterprise Data Design and Analysis 5 Example Transaction: Online Store Your purchase transaction: • Atomicity: Either the complete purchase happens, or nothing • Consistency: The inventory and internal accounts are updated correctly • Isolation: It does not matter whether other customers are also currently making a purchase • Durability: Once you have received the order confirmation number, your order information is permanent, even if the site crashes NBA 518 Spring 2004: Lecture 1 16 Transactions (Contd.) A transaction will commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions. NBA 518 Spring 2004: Lecture 1 17 Example Transaction: ATM You withdraw money from the ATM machine • Atomicity • Consistency • Isolation • Durability Commit versus Abort? What are reasons for commit or abort? NBA 518 Spring 2004: Lecture 1 18 NBA 518: Enterprise Data Design and Analysis 6 Transactions: Examples Give examples of transactions in the following applications. Which of the ACID properties are needed? • EBay (www.ebay.com) • Barnes and Noble (www.bn.com) • General Motors (www.gm.com) • The Protein Data Bank (http://www.rcsb.org/pdb) • Sprint (www.sprint.com) • Your cell phone NBA 518 Spring 2004: Lecture 1 19 What Makes Transaction Processing Hard • Reliability - system should rarely fail • Availability - system must be up all the time • Response time - within 1-2 seconds • Throughput - thousands of transactions/second • Scalability - start small, ramp up to Internet-scale • Security – for confidentiality and high finance • Configurability - for above requirements + low cost • Atomicity - no partial results • Durability - a transaction is a legal contract • Distribution - of users and data NBA 518 Spring 2004: Lecture 1 20 Reliability and Availability • Reliability - system should rarely fail • Availability - system must be up all the time Downtime Availability 1 hour/day 95.8% 1 hour/week 99.41% 1 hour/month 99.86% 1 hour/year 99.9886% 1 minute/day 99.9988% 1 hour/20years 99.99942% 1 minute/week 99.99983% NBA 518 Spring 2004: Lecture 1 21 NBA 518: Enterprise Data Design and Analysis 7 Performance • Response time - within 1-2 seconds • Throughput - thousands of transactions/second • Scalability - start small, ramp up to Internet- scale NBA 518 Spring 2004: Lecture 1 22 What Makes TP Important? • It is at the core of electronic commerce • Most medium-to-large businesses use TP for their production systems. The business can’t operate without it. • It is a huge slice of the computer system market — over $50B/year. Probably the single largest application of computers. NBA 518 Spring 2004: Lecture 1 23 TP System Infrastructure • User’s viewpoint • Enter a request from a browser or other display device • The system performs some application-specific work, which includes database accesses • Receive a reply (usually, but not always) • The TP system ensures that each transaction • is an independent unit of work • executes exactly once, and • produces permanent results. • TP system makes it easy to program transactions • TP system has tools to make it easy to manage NBA 518 Spring 2004: Lecture 1 24 NBA 518: Enterprise Data Design and Analysis 8 TP System Infrastructure End-User Presentation Manager Front-End (Client) requests Workflow Control (routes requests) Back-End Transaction Program (Server) NBADatabase 518 Spring 2004: System Lecture 1 25 System Characteristics • Typically < 100 transaction types per application • Transaction size has high variance. Typically, • 0-30 disk accesses • 10K - 1M instructions executed • 2-20 messages • A large-scale example: airline reservations • 150,000 active display devices • plus indirect access via Internet travel agents • thousands of disk drives • 3000 transactions per second, peak NBA 518 Spring 2004: Lecture 1 26 Exercise • Reliability - system should rarely fail • Availability - system must be up all the time • Response time - within 1-2 seconds • Throughput - thousands of transactions/second • Scalability - start small, ramp up to Internet-scale • Security – for confidentiality and high finance • Configurability - for above requirements + low cost • Atomicity - no partial results • Durability - a transaction is a legal contract • Distribution - of users and data • Question: Think of a TP System that you know of, and discuss with your

Enterprise Data Analysis and Design

Big Data Velocity in Plain English

PS Non-Standard Database Systems Overview

Performance Analysis of Blockchain Platforms

Evaluating and Comparing Oracle Database Appliance Performance Updated for Oracle Database Appliance X8-2-HA

60-539: Emerging Non-Traditional Database Systems (Data Warehousing and Mining)

A Survey of Ledger Technology-Based Databases

CSE 344 Final Examination Name

Extending ACID Semantics to the File System

Chapter 17: Parallel Databases

Performance Benchmark Postgresql / Mongodb Performance Benchmark Postgresql / Mongodb

Hyperledger Blockchain Performance Metrics

Administration Guide Version 10.0