NEBC Database Course 2008
Implementing a Relational Database
Tim Booth : [email protected] Implementing a relational database
establish requirements Data Requirements data analysis
Conceptual Data Model
Logical Schema Specification
implement
schema and database Our design so far: relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism relation Organism OrganismNo: Integer Species: String Strain: String GenomeSeq: Boolean CommonName: String primary key: OrganismNo relation Feature FeatID: String Name: String primary key: FeatID foreign key: SourceSequence references Sequence Implementing our database
• Most of hard work is already done • Create relations using SQL • Define the data types for our columns • Define primary and foreign keys
• Add constraints • Add any appropriate default values Our design so far:
relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName Create Table relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName
CREATE TABLE publicdatabase ( databasename , datatype , url ); Add data types
Reminder: • Numerical • integer,float,numeric • String/Text • varchar,text • Date/Time • timestamp,date • Boolean Add data types relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName
CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) ); Primary Keys relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName
CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200), primary key (databasename) ); Foreign Keys relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism
CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(50), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase, foreign key (sourceorganism) references organism ); Foreign Keys relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism
CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase(databasename), foreign key (sourceorganism) references organism(organismnumber) ); feature table
CREATE TABLE feature ( featid varchar(50), name varchar(100), sourcesequence varchar(50), primary key (featid), foreign key (sourcesequence) references sequence(accessionnumber) ); organism table
CREATE TABLE organism ( organismnumber integer species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) ); Constraints
• Constraints restrict the values that can be inserted or updated in columns • Types of constraints • NOT NULL • UNIQUE • Simply add to column definition url varchar(100) NOT NULL or url varchar(100) UNIQUE • NOT NULL and UNIQUE implicit on primary key Constraints
• CHECK constraint numberoflegs integer CHECK (numberoflegs>2) publicdatabase table
CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) UNIQUE, primary key (databasename) ); Constraints
• To keep links between tables working you need to preserve the matching values – referential integrity • automatically set up when you declare the primary and foreign keys
• This will prevent you from deleting a record with a primary key before you have deleted all the child foreign key records Constraints
• Example from 'BigHit' database Constraints organism table
CREATE TABLE organism ( organismnumber integer, species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );
How shall we create the unique primary key values for organismnumber? Sequences
• Sequence is a database object in PostgreSQL which gives you an automatically incrementing numeric value (equivalent to 'autonumber' in Access)
CREATE SEQUENCE my_seq (can specify increment,min and max)
SELECT NEXTVAL('my_seq') Default values
• Still don't want to have to select the value each time • Can set a default value for column which is automatically filled in every time a record is inserted
CREATE TABLE organism ( organismnumber integer DEFAULT NEXTVAL('my_seq'), species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) ); Create your database
• To create your database run your SQL table and other object creation statements in a single script • Example - demodatabase.sql • Be sure create tables in the right order • Can't create table that refers to a primary key in a table that doesn't exist yet • You also need data... Populate your database
• Insert data using INSERT sql statements
INSERT INTO organism (species,strain,genomeseq,commonname) VALUES ('Oryctolagus cuniculus',NULL,'false','rabbit');
• Default values will inserted automatically
CREATE TABLE organism ( organismnumber integer DEFAULT NEXTVAL('my_seq'), ... Populate your database
• Be sure to insert data in correct order
• Don't try and insert a foreign key value when the primary key value hasn't been inserted yet
• Run the demodatabase.sql script Querying your database
• Now that your database is set up and data has been inserted we can query it Database Sequence Type ID Accession Organism Name Number Swissprot Protein Phosphorylase B KPB1_Rabit Rabbit kinase alpha regulatory chain
UniProt Protein Troponin T Trt3_rabit
Swissprot Proten Glycogen PHS2_RABIT rabbit phosphorylase Swissprot protein Troponin I TRIC_RABIT EMBL Nucleotide rabbit muscle OCPHOS2 rabbit phosphorylase mrna dbEST Nucleotide CK829726 Rabbit TrEMBL protein pol protein Q8MJF7 Rabbit EMBL Nucleotide Rabbit OXPKA Rabbit phosphorylase PDB protein_structure Glycogen 1ABB rabit Phosphorylase Querying your database Querying your database Querying your database Querying your database
• What have we gained? • No data redundancy • Data is consistent • Enforced quality control – no missing data • Only have to change data once • Flexibility to run a variety of queries Views
• Views are queries that are saved in the database as objects • Appear much like a table which can be queried in the same way • Good if underlying query is very complex
CREATE VIEW viewname AS query Views Views
Indexes
• Searching data by scanning is slow • Indexes make this searching faster • Implicit indexes are set up for primary keys as these are used a lot for searching data Indexes
• An index can be created on any column
CREATE INDEX orgname_idx on organism (commonname)
• An index is helpful on a column that is regularly searched on (i.e. Used in the WHERE clause) Index worked example
• There are more movies in the file demodata/moremovies.csv • These are already loaded into the database table 'demodata.moremovies' • in PGAdmin3: • INSERT INTO bighit.movie (SELECT * FROM demodata.moremovies);
• SELECT * FROM movie where rating = 'U'; Index worked example
• Explain the query
• Now make an index: • CREATE INDEX myindex ON movie (rating); • Now explain the original query again • This works for very complex queries! Complex Query Analysis The “MART” Strategy
• Normalisation is the process of removing data redundancy from your database design • But it adds complexity • Views can make querying simpler • Indexes can make querying faster • But... Sometimes this is not enough. Maintaining a summary table for quick querying is known as 'denormalisation' • Many large databases (eg. EnsEMBL) resort to this More features...
• stored procedures • triggers • cascading updates • custom types • custom functions • extension modules • load balancing • replication • ...