Identifying Disease Genes
Total Page:16
File Type:pdf, Size:1020Kb
Genomics for today • Cancer genomics • Reproductive health • Forensic genomics • Agrigenomics • Complex disease genomics • Microbial genomics • Genomics in Drug and development • and more …omics Data/File formats • File format, a format for encoding data for storage in a computer file which is a standardized file format • Storage, access, sharing, interpretation, security, etc. http://en.wikipedia.org/wiki/Data_format Bioinformatics for dummies http://www.dummies.com/how-to/content/bioinformatics-data-formats.html Scientific data formats 23andMe microarray track data Browser Extensible Data Format AB1 (Chromatogram files used by DNA sequencing instruments from Applied Biosystems) MINiML (MIAME Notation in Markup Language) ABCD (Access to Biological Collection Data) mini Protein Data Bank Format ABCDDNA (Access to Biological Collection Data DNA extension) MIQAS-TAB (Minimal Information for QTLs and Association Studies Tabular) ABCDEFG (Access to Biological Collection Data Extension For Geosciences) MITAB ACE (Sequence assembly format) mmCIF (macromolecular Crystallographic Information File) Affymetrix Raw Intensity Format Multiple Alignment Forma ARLEQUIN Project Format mzData (deprecated) Axt Alignment Format mzIdentML BAM (Binary compressed SAM format) mzML BED (Browser extensible display format describing genes and other features of DNA sequences) mzQuantML BEDgraph mzXML (deprecated) Big Browser Extensible Data Format NCD (Natural Collections Descriptions) Big Wiggle Format NDTF (Neurophysiology Data Translation Format) Binary Alignement Map Format net alignment annotation Format Binary Probe Map Format NeuroML (Neuroscience eXtensible Markup Language) Binary sequence information Format New Hampshire eXtended Format Biological Pathway eXchange Newick tree Format BLAT alignment Format NEXUS (Encodes mixed information about genetic sequence data in a block structured format) BRIX generated O Format Nimblegen Design File Format CAF (Common Assembly Format for sequence assembly) Nimblegen Gene Data Format CellML NMR-STAR (NMR Self-defining Text Archive and Retrieval format) CHADO XML interchange Format nucleotide inFormation binary Format Chain Format for pairwise alignment ODM (Operational Data Model) CHARMM Card File Format Open Biomedical Ontology Flat File Format CLUSTAL-W Alignment Format Personal Genome SNP Format CLUSTAL-W Dendrogram Guide File Format PHD (Output from the basecalling software Phred) Clustered Data Table Format phyloXML (XML for evolutionary biology and comparative genomics) Complete Genomics Pre-Clustering File Format DELTA (DEscription Language for TAxonomy) Protein Data Bank (PDB; Structures of biomolecules deposited in Protein Data Bank) DAS (Distributed Sequence Annotation System) Protein InFormation Resource Format DBN (Dot Bracket Notation (DBN) - Vienna Format) PRM (Protocol Representation Model (Medical Research)) EMBL (Flatfile format used by the EMBL for nucleotide and peptide sequences) PSI-MI XML EML (Environmental Markup Language) not to be confused with EML (Ecological Metadata Language) PSI-PAR ENCODE (Peak information Format) RDML (Real-time PCR Data Markup Language) FASTA and FASTQ (File format for sequence data, FASTQ with quality) SAM (Sequence Alignment/Map format) FuGEFlow SCF (Staden chromatogram files used to store data from DNA sequencing) FuGE-ML (Functional Genomics Experiment Markup Language) SBML (Systems Biology Markup Language used to store biochemical network computational models) Gating-ML SDD (Structured Descriptive Data) GCDML (Genomic Contextual Data Markup Language) SED-ML (Simulation Experiment Description Markup Language) GelML Gel electrophoresis Markup Language Sequence Alignment Map Format GenBank (Flatfile format used by NCBI for nucleotide and peptide sequences) SOFT (Simple Omnibus Format in Text) Gene Feature File (Versions 1 and 3) spML (Separation Markup Language) GFF (General feature format for describing genes and other features of DNA, RNA and protein sequences) SRA-XML (Short Read Archive eXtensible Markup Language) Gene Prediction File Format Standard Flowgram Format GenePattern GeneSet Table Format Stockholm Multiple Alignment Format (Representing multiple sequence alignments) Genome Annotation File (version 1 and 2) SBML (System Biology Markup Language) GTF (Gene transfer format holds information about gene structure) SBGN (Systems Biology Graphical Notation) HMMER SBRML (Systems Biology Results Markup Language) ICB (ICM binary file Format) Swiss-Prot (Flatfile format used for protein sequences from the Swiss-Prot database) Image Cytometry Standard (ICS) TAIR annotation data Format imzML (imaging mz Markup Language) TAPIR (TDWG Access Protocol for Information Retrieval) ISA-Tab (Investigation Study Assay Tabular) TCS (Taxonomic Concept transfer Schema) ISND sequence record XML TraML (Transition Markup Language) KGML (KEGG Mark-up Language) UniProtKB XML Format MAGE-Tab (MicroArray Gene Expression Tabular) VCF (Variant Call Format) MCL (Microbiological Common Language) Wiggle Format MIARE-TAB (Minimum Information About a RNAi Experiment Tabular) http://en.wikipedia.org/wiki/List_of_file_formats#Biology http://fileformats.archiveteam.org/wiki/Scientific_Data_formats Fasta format >Description line Sequence >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens] MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF Sequence identifiers GenBank gb|accession|locus EMBL Data Library emb|accession|locus DDBJ, DNA Database of Japan dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|entry name Brookhaven Protein Data Bank pdb|entry|chain Patents pat|country|number GenInfo Backbone Id bbs|number General database identifier gnl|database|identifier NCBI Reference Sequence ref|accession|locus Local Sequence identifier lcl|identifier File extension Extension Meaning Notes faa fasta amino acid Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa. fasta (.fas) generic fasta Any generic fasta file. Other extensions can be fa, seq, fsa ffn FASTA nucleotide coding regions Contains coding regions for a genome. fna fasta nucleic acid Used to generically specify nucleic acids. frn FASTA non-coding RNA Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA BioXSD: the common data-exchange format for everyday bioinformatics web services • for basic bioinformatics data • XML schema • syntax for biological sequences, annotations, alignments, and references to resources http://bioinformatics.oxfordjournals.org/content/26/18/i540.full 8. How to identify disease biomarkers Balaji Rajashekar 20.11.14 Human By mass, human cells consist of 65–90% water (H2O). Oxygen therefore contributes a majority of a human body's mass. Almost 99% of the mass of the human body is made up of the six elements oxygen, carbon, hydrogen, nitrogen, calcium, and phosphorus. About 0.75% of the remainder is composed of only five elements: sodium, phosphorus, potassium, sulfur, and chlorine. The remaining elements are trace elements. Note that not all elements which are found in the human body play a role in life. Elemental composition The average 70 kg adult human body contains approximately 6.7 x 1027 atoms and is composed of 60 chemical elements. The elements needed for life are relatively common in the Earth's crust, and conversely most of the common elements are necessary for life. An exception is aluminium, which is the third most common element in the Earth's crust (after oxygen and silicon). Contents : Flesh, Blood, http://en.wikipedia.org/wiki/Composition_of_the_human_body Composition The composition can also be expressed in terms of chemicals, such as: Water Proteins – including those of hair, connective tissue, etc. Fats (or lipids) Apatite in bones Carbohydrates such as glycogen and glucose DNA Dissolved inorganic ions such as sodium, potassium, chloride, bicarbonate, phosphate Gases such as oxygen, carbon dioxide, nitrogen oxide, hydrogen, carbon monoxide, methanethiol. These may be dissolved or present in the gases in the lungs or intestines. Many other small molecules, such as amino acids, fatty acids, nucleobases, nucleosides, nucleotides, vitamins, cofactors. Free radicals such as superoxide, hydroxyl, and hydroperoxyl. Materials Body composition can also be expressed in terms of various types of material, such as: Muscle Fat Bone and teeth Brain and nerves Connective tissue Blood – 7% of body weight. Lymph Contents of digestive tract, including intestinal gas Urine Air in lungs Composition by cell type : There are many species of bacteria and other microorganisms that live on or inside the healthy human body. In fact, 90% of the cells in (or on) a human body are microbes, by number (much less by mass or volume). Some of these symbionts are necessary for our health. Those that neither help nor harm us are called commensal organisms. Data and analysis Formats : NetCDF netcdf out { dimensions: __string = 11 ; n = 4 ; m = 5 ; variables: char empty(__string) ; int year(n) ; Population growth in Cities in corresponding year char city(m, __string) ; float population(m, n) ; 1900 1940 1970 2000 Los 0.102 1.504 2.812 3.695 // global attributes: Angeles :__str_len = 11 ; data: Washingt 0.279 0.663 0.757 0.572 on empty = "" ; New York 3.437 7.455 7.896 8.008 year = 1900, 1940, 1970, 2000 ; Seattle 0.081 0.368 0.531 0.563 city = London 6.528 8.197 7.452 7.322 "Los Angeles", "Washington",