NEBC Database Course 2008 – Bonus Material
Total Page:16
File Type:pdf, Size:1020Kb
NEBC Database Course 2008 ± Bonus Material Part I, Database Normalization ªI once had a client ask me to normalize a database that already was. When I asked her what she was really asking for, she said, ©You know, so normal people can read it.© She wasn©t very techno-savvy.º Quetzzz on digg.com, 31st May 2008 In the course, we have shown how to design a well-structured database for your own use. The formal process of structuring your data in this way is known as normalization. Relational database theory, and the principles of normalisation, were first constructed by people with a strong mathematical background. They wrote about databases using terminology which was not easily understood outside those mathematical circles. Data normalisation is a set of rules and techniques concerned with: • Identifying relationships among attributes. • Combining attributes to form relations. • Combining relations to form a database. It follows a set of rules worked out by E F Codd in 1970. A normalised relational database provides several benefits: • Elimination of redundant data storage. • Close modelling of real world entities, processes, and their relationships. • Structuring of data so that the model is flexible. There are numerous steps in the normalisation process - 1st Normal Form (1NF), 2NF, 3NF, Boyce- Codd Normal Form (BCNF), 4NF, 5NF and DKNF. Database designers often find it unnecessary to go beyond 3rd Normal Form. This does not mean that those higher forms are unimportant, just that the circumstances for which they were designed often do not exist within a particular database. 1st Normal Form A table is in first normal form if all the key attributes have been defined and it contains no repeating groups. Chris Date expands this definition as follows: 1. There©s no top-to-bottom ordering to the rows. 2. There©s no left-to-right ordering to the columns. 3. There are no duplicate rows. 4. Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else). 5. All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps]. Point 4 is considered the key feature of 1NF. It essentially says that you can©t squeeze in extra data. For example if you have a PERSON table with an ©address© column, you can©t just put the phone number on the end of the address without violating 1NF. Likewise, if you do add a ©phone_number© column and then find that someone has two or three numbers, you can©t add the multiple values into the one column, and you shouldn©t add extra columns to deal with the repeating values. To satisfy 1NF in such circumstances a new relation must be created. 2nd Normal Form A table is in second normal form (2NF) if and only if it is in 1NF and every non key attribute is fully functionally dependent on the whole of the primary key (i.e. there are no partial dependencies). Anomalies can occur when attributes are dependent on only part of a multi-attribute (composite) key. This indicates that redundant information is being stored. 3rd Normal Form A table is in third normal form (3NF) if and only if it is in 2NF and every non key attribute is non transitively dependent on the primary key (i.e. there are no transitive dependencies). This is basically a more general case of 2NF. 2NF can always be satisfied by using a surrogate key, but the partial dependencies remain and violate 3NF. Consider the following table of customer orders: order(order_id, customer_id, cust_address, cust_phone, date, total) Here, the order_id and customer_id form the candidate key, but the customer address and phone number are only going to depend on the customer ID, not the order ID. To satisfy 2NF we could ensure that order IDs are all unique and use this as the primary key. However, this violates 3NF as the cust_address and cust_phone are still tied to the customer_id rather than being direct properties of the order. To satisfy 3NF here we must split out customer and order details into separate relations. Boyce-Codd Normal Form A table is in Boyce-Codd normal form (BCNF) if and only if it is in 3NF and every determinant is a candidate key. If you really want to know what this means, along with the further normal forms, it©s probably time to buy a proper book (or get out more)! Mnemonic The basic principle of normalisation can be summed up thus: ªAll attributes must depend on the key, they whole key, and nothing but the keyº Adapted from the website of Tony Marston :- http://www.tonymarston.co.uk/php-mysql/database-design.html#normalisation Additional info from Wikipedia Part II, Transactions Transactions (in the computing sense) are not strictly limited to relational databases but there is a strong historical association. Implementing them properly requires specific capabilities of the software, which is often referred to as ACID compliance. The following comes, with a few edits, from good old Wikipedia: In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. An example of a transaction is a transfer of funds from one account to another, even though it might consist of multiple individual operations (such as debiting one account and crediting another). Atomicity Atomicity states that sets of database modifications must follow an ªall or nothingº rule. Each transaction is said to be ªatomic.º For example, the transfer of funds from one account to another can be completed or it can fail for a multitude of reasons, but atomicity guarantees that one account won©t be debited if the other is not credited. If one part of the transaction fails, the entire transaction fails. Consistency The Consistency property ensures that the database remains in a consistent state before the start of the transaction and after the transaction is over (whether successful or not). Consistency states that only valid data will be written to the database. If, for some reason, a transaction is executed that violates the database's consistency rules, the entire transaction will be rolled back and the database will be restored to a state consistent with those rules. On the other hand, if a transaction successfully executes, it will take the database from one state that is consistent with the rules to another state that is also consistent with the rules. Isolation Isolation refers to the requirement that other operations cannot access or see the data in an intermediate state during a transaction. Since RDBMSs allow multiple simultaneous connections, this constraint is required to ensure the consistency rule holds between simultaneous transactions in a database system. Durability Durability refers to the guarantee that once the user has been notified of success, the transaction will persist, and not be undone. This means it will survive system failure (ie. the data has been written out to disk, not just RAM), and that the database system has checked the integrity constraints and won©t need to abort the transaction. Part III, Bulk Loading Data One typical requirement when implementing a database is to load a large batch of data from, say, a spreadsheet. Unfortunately PGAdmin3 does not yet support this feature, but there are a couple of approaches you can take that are reasonably simple. The following are specific to PostgreSQL but MySQL etc. will have similar features. Direct loading with COPY and psql psql is the command-line client for PostgreSQL. You can tell it the host server (-h) and user name (-U) to connect with, amongst other things. Normally, psql will give you an interactive prompt, but by passing a command with the -c flag we can execute a single command. psql -h dbserver.nox.ac.uk -U student01 -c \ 'COPY movie (title,genre,length,rating,year,rentalprice) '\ 'FROM STDIN WITH CSV HEADER;' < moremovies.csv Note that this is a single line to be entered at the shell prompt but has been wrapped to fit on the page. The backslash (\) characters tell the shell to continue the command onto the next line. The COPY command is similar to INSERT, but takes the values from a file. The WITH CVS HEADER part tells the database to ignore the first (header) line of the file and to expect CSV data from the rest. If any data in the file does not fit into the table definition (eg. text in a numeric field, missing data, duplicate values in a column with a unique constraint) then the whole COPY will fail. Indirect loading with the Quick Loader script See: http://darwin.nerc-oxford.ac.uk/pgp-wiki/index.php/PostgreSQL_and_MySQL_scripts This Perl script provides a convenient way to load a CSV file directly into the database. It makes a table structure and generates the correct COPY command to populate the table in one go. The script is not entirely bullet-proof, but should work so long as you have a well-formed CSV file (eg. exported form Excel or Calc) with a single line of column headings. You run, eg: perl csvtopg.perl myfile.csv > myfile.sql Now you can load the file to your database with psql and the -f flag: psql -h myhost -d mydatabase -f myfile.sql The resulting table is almost certainly not ready for use, as all columns will be text and no keys will be defined, but you can now easily load the data from here into your real table.