<<

NEBC Course 2008

Implementing a

Tim Booth : [email protected] Implementing a relational database

establish requirements Data Requirements data analysis

Conceptual Data Model

Logical Schema Specification

implement

schema and database Our design so far: PublicDatabase DatabaseName: String DataType: String URL: String : DatabaseName relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber : SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism relation Organism OrganismNo: Integer Species: String Strain: String GenomeSeq: Boolean CommonName: String primary key: OrganismNo relation Feature FeatID: String Name: String primary key: FeatID foreign key: SourceSequence references Sequence Implementing our database

• Most of hard work is already done • Create relations using SQL • Define the data types for our columns • Define primary and foreign keys

• Add constraints • Add any appropriate default values Our design so far:

relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName Create relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename , datatype , url ); Add data types

Reminder: • Numerical • integer,float,numeric • String/Text • varchar,text • Date/Time • timestamp,date • Boolean Add data types relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) ); Primary Keys relation PublicDatabase DatabaseName: String DataType: String URL: String primary key: DatabaseName

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200), primary key (databasename) ); Foreign Keys relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism

CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(50), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase, foreign key (sourceorganism) references organism ); Foreign Keys relation Sequence AccessionNumber: String ID: String primary key: AccessionNumber foreign key: SourceDatabase references PublicDatabase foreign key: SourceOrganism references Organism

CREATE TABLE sequence ( accessionnumber varchar(50), id varchar(50), sourcedatabase varchar(30), sourceorganism integer, primary key (accessionnumber), foreign key (sourcedatabase) references publicdatabase(databasename), foreign key (sourceorganism) references organism(organismnumber) ); feature table

CREATE TABLE feature ( featid varchar(50), name varchar(100), sourcesequence varchar(50), primary key (featid), foreign key (sourcesequence) references sequence(accessionnumber) ); organism table

CREATE TABLE organism ( organismnumber integer species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) ); Constraints

• Constraints restrict the values that can be inserted or updated in columns • Types of constraints • NOT NULL • UNIQUE • Simply add to definition url varchar(100) NOT NULL or url varchar(100) UNIQUE • NOT NULL and UNIQUE implicit on primary key Constraints

• CHECK constraint numberoflegs integer CHECK (numberoflegs>2) publicdatabase table

CREATE TABLE publicdatabase ( databasename varchar(50), datatype varchar(20), url varchar(200) UNIQUE, primary key (databasename) ); Constraints

• To keep links between tables working you need to preserve the matching values – • automatically set up when you declare the primary and foreign keys

• This will prevent you from deleting a record with a primary key before you have deleted all the child foreign key records Constraints

• Example from 'BigHit' database Constraints organism table

CREATE TABLE organism ( organismnumber integer, species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) );

How shall we create the unique primary key values for organismnumber? Sequences

• Sequence is a database object in PostgreSQL which gives you an automatically incrementing numeric value (equivalent to '' in Access)

CREATE SEQUENCE my_seq (can specify increment,min and max)

SELECT NEXTVAL('my_seq') Default values

• Still don't want to have to select the value each time • Can set a default value for column which is automatically filled in every time a record is inserted

CREATE TABLE organism ( organismnumber integer DEFAULT NEXTVAL('my_seq'), species varchar(100), strain varchar(100), genomeseq boolean, commonname varchar(100), primary key (organismnumber) ); Create your database

• To create your database run your SQL table and other object creation statements in a single script • Example - demodatabase. • Be sure create tables in the right order • Can't create table that refers to a primary key in a table that doesn't exist yet • You also need data... Populate your database

• Insert data using INSERT sql statements

INSERT INTO organism (species,strain,genomeseq,commonname) VALUES ('Oryctolagus cuniculus',NULL,'false','rabbit');

• Default values will inserted automatically

CREATE TABLE organism ( organismnumber integer DEFAULT NEXTVAL('my_seq'), ... Populate your database

• Be sure to insert data in correct order

• Don't try and insert a foreign key value when the primary key value hasn't been inserted yet

• Run the demodatabase.sql script Querying your database

• Now that your database is set up and data has been inserted we can query it Database Sequence Type ID Accession Organism Name Number Swissprot Protein Phosphorylase B KPB1_Rabit Rabbit kinase alpha regulatory chain

UniProt Protein Troponin T Trt3_rabit

Swissprot Proten Glycogen PHS2_RABIT rabbit phosphorylase Swissprot protein Troponin I TRIC_RABIT EMBL Nucleotide rabbit muscle OCPHOS2 rabbit phosphorylase mrna dbEST Nucleotide CK829726 Rabbit TrEMBL protein pol protein Q8MJF7 Rabbit EMBL Nucleotide Rabbit OXPKA Rabbit phosphorylase PDB protein_structure Glycogen 1ABB rabit Phosphorylase Querying your database Querying your database Querying your database Querying your database

• What have we gained? • No data redundancy • Data is consistent • Enforced quality control – no missing data • Only have to change data once • Flexibility to run a variety of queries Views

• Views are queries that are saved in the database as objects • Appear much like a table which can be queried in the same way • Good if underlying query is very complex

CREATE VIEW viewname AS query Views Views

Indexes

• Searching data by scanning is slow • Indexes make this searching faster • Implicit indexes are set up for primary keys as these are used a lot for searching data Indexes

• An index can be created on any column

CREATE INDEX orgname_idx on organism (commonname)

• An index is helpful on a column that is regularly searched on (i.e. Used in the WHERE clause) Index worked example

• There are more movies in the file demodata/moremovies.csv • These are already loaded into the database table 'demodata.moremovies' • in PGAdmin3: • INSERT INTO bighit.movie (SELECT * FROM demodata.moremovies);

• SELECT * FROM movie where rating = 'U'; Index worked example

• Explain the query

• Now make an index: • CREATE INDEX myindex ON movie (rating); • Now explain the original query again • This works for very complex queries! Complex Query Analysis The “MART” Strategy

• Normalisation is the process of removing data redundancy from your database design • But it adds complexity • Views can make querying simpler • Indexes can make querying faster • But... Sometimes this is not enough. Maintaining a summary table for quick querying is known as 'denormalisation' • Many large (eg. EnsEMBL) resort to this More features...

• stored procedures • triggers • cascading updates • custom types • custom functions • extension modules • load balancing • replication • ...