FR3 Primer v1 June 23, 2014

FR3 Bioinformatics Primer: Glossary of terms*

(*from various sources including Discovering , Proteomics, & Bioinformatics 2nd edition, Campbell and Heyer; The Internet and the New by Peruski and Peruski; and Bioinformatics for Dummies by Claverie and Notredame)

Accession number identification Antisense technology molecular number given to every DNA and method that uses a nucleic acid sequence submitted to NCBI sequence complementary to an or equivalent database. mRNA so that the two bind and the mRNA is effectively neutralized. Algorithm step-by-step procedure for solving a problem (e.g. aligning Array an orderly pattern of objects. two sequences) or computing a In genomic studies, there are quantity (e.g. %GC). Typically microarrays and macroarrays. written in Perl or another computer Microarrays are small spots of DNA language. or protein and the identity of the spotted material is known. Alignment representation of two or Macroarrays are bacterial, or more protein or similar colonies on plates used to sequences where identical amino determine functional consequences acids or are in the same of genomic manipulations. columns while missing amino acids or nucleotides are replaced with Anonymous FTP A file transfer gaps. protocol (FTP) that allows the retrieval of files from public sites. Allele prevalence of a variant in a population. Archive a collection of data, text, programs or other electronic Annotate a is annotated information stored for other parties to once it has been analyzed for gene access or retrieve, typically without content. A gene is considered charge. annotated if it has been assigned information pertaining to a cellular ASCII stands for American Standard role. Code for Information Interchange, sets a unique numeric definition for Antibody array An antibody each of the most commonly used microarray is a specific form of characters in Western language, protein microarrays, a collection of including numerals, letters and some capture antibodies are spotted and symbols. An ASCII set is these fixed on a solid surface such as characters and their numbers. In the glass, plastic or silicon chip, for the context of a file, an ASCII file is one purpose of detecting antigens. that contains only text characters:

1 FR3 Bioinformatics Primer v1 June 23, 2014 numbers, letters and standard punctuation. Bootstrap analysis a statistical technique used to assess the Bacterial artificial reliability of the branching structure (BAC) cloning vector replicated in in phylogenetic trees. bacteria that can hold an insert of about 150,000 base pairs. Bowtie software that maps short DNA sequences to a reference BAM file a binary version of a SAM genome. file, the format is .bam. Browser client software that reads BankIt a computer program HTML-formatted documents and developed by the NCBI for displays them on your computer submitting your own sequences to screen. GenBank. cDNA, see complementary DNA Binomial a probability model for counting the number of ‘successes’ CDS acronym for ‘coding sequences’ in an experiment with two possible outcomes. Cellular component a term coined by Gene Ontology to describe Bioinformatics a field of study that subcellular structures, locations and extracts biological information from macromolecular complexes such as large data sets such as sequences, nucleus, telomere, and mitotic protein interactions, microarrays, etc. spindles. This field also includes the area of data visualization. ChIP-Seq, chromatin immunoprecipitation sequencing Biological process coined by Gene combination of chromatin Ontology to describe broad cellular immunoprecipitation and massively outcomes, such as mitosis or energy parallel sequencing used to survey production, that are accomplished by interactions between protein, DNA ordered assemblies of molecules. and RNA.

BLAST the protein and nucleic acid Chromatogram a four-colored graph sequence search engine developed produced from nonradioactive at NCBI that allows you to search dideoxy sequencing methods sequence databases. BLASTn including cycle sequencing. searches for nucleotide sequences, BLASTp searches for Clade group of related species and sequences; BLAST2 compares 2 their common ancestors. sequences. Cladogram phylogenetic tree BLOSUM, block scoring matrix, a showing the relationship between the popular substitution matrix for species. aligning protein sequences.

2 FR3 Bioinformatics Primer v1 June 23, 2014

Client In its simplest form a client is Conserved domain (CD) a domain a software program that an individual that has been retained during uses to send requests to a server. In presumably due to its a broader definition it can mean a essential role within the protein’s computer or computer program that structure. Conserved domain requests a service of a host searches are a part of the BLAST computer or program. search.

Clone noun or verb. A clone is any Constitutive always active; molecule// present in constitutive promoters or are more than one identical copy. To active at all times. clone something means to produce more than one copy of the original Contiguous (contigs) overlapping molecule/cell/organism. DNA segments that as a collection form a longer and gapless segment ClustalW the most popular program of DNA. for making multiple sequence alignments. Correlation coefficient a measure of the correspondence between two Clustering action of grouping sets of data. sequences according to their similarity Covariance analysis the study of a sequence that follows the evolution Clusters of orthologous groups of two different positions and looks COG NCBI compilations of for some correlation. evolutionarily related gene sequences from several microbial Coverage based on the number of . This site allows you to bases sequenced in a genome, the search by gene or cellular role and coverage represents how many produce dendrograms to show times an average base was sequence similarities. sequenced; finished genomes frequently have 8X coverage. Comparative genomics analysis based on comparisons of entire Cufflinks software that accepts genomes. aligned RNAseq reads and estimates the relative abundances of Complementary DNA abbreviation transcripts. Also tests for differential used at NCBI to indicate which expression between samples. bases constitute the open . Cy3 and Cy5 fluorescent green and red dyes, respectively, that are Consensus a pseudo sequence commonly used for microarray where each residue summarizes experiments. (using the majority rule) the content of a column in a multiple sequence Curate oversee/annotate/manipulate alignment. gene data (often manually). Includes

3 FR3 Bioinformatics Primer v1 June 23, 2014 fine annotation of genes, filtering of Draft sequence a description of the genes according to certain degree of confidence in a DNA parameters and construction and sequence. When the DNA has been manipulation of gene models. sequenced only 4 times it is described as a ‘draft sequence’ as Dendrogram another name for opposed to a ‘finished sequence’ phylogenetic tree. The ClustalW which has been sequenced 8 times. guide tree is often referred to as a With a draft sequence about 95% of dendrogram, it is a file with a .dnd the genes should be identifiable. extension. dsDNA double stranded DNA, Distance matrix matrix that contains formed when two complementary all the pairwise distances between DNA sequences bind each other. sequences in a data set. Distance matrices afe often u sed to compute dsRNA double stranded RNA, phylogenetic trees. Not a formed when two complementary substitution matrix. RNA sequences bind to each other.

DNA microarray or DNA chip E value (Expect value) when synonyms for gene sequences performing a BLAST search, you will spotted on glass slides used to obtain an E-value for each sequence measure simultaneously the level of that is retrieved. An E-value can be transcription of many genes. thought of as the probability that two sequences are similar to each other Domain (computational) a network by chance. Therefore E-values are or portion thereof that operates best when they are small (e.g. 1 x under a single administrative 10-12). umbrella such as an institution, a group of institutions, a geographical EC number, Enzyme Commission region, or a country. number a numerical classification scheme for enzymes based on the Domain (molecular) a protein unit chemical reactions they catalyze. of at least 50 amino acids that can fold on its own, can also be used for EBI, European Bioinformatics protein sequences that occur in Institute the European counterpart various contexts. to the NCBI in the U.S.

Dot matrix dot plot, a method for EMBL, European Molecular representing the similarity between Biology Laboratory, this acronym two sequences without using an refers to the nucleotide database alignment. that the laboratory maintains.

Downstream a relative direction for Entrez the NCBI database querying a DNA sequence; towards the 3’ system that is similar to SRS at the end. EBL.

4 FR3 Bioinformatics Primer v1 June 23, 2014

Epitope a part of an antigen that is commend method of transferring recognized by the immune system. files across networks and between remote computers. EST, expressed sequence tag a short subset of a cDNA sequence. GA II Illumina sequencing platform that generates 40 Mb sequence per Expansionist a person who takes a run with paired read lengths of 70 systems approach to understanding bp. complex, interconnected parts working as a whole; opposite of a Gap in sequencing a segment of reductionist. DNA that has not been sequenced but is flanked by sequenced DNA. ExPASy a server maintained by the Also used as a synonym for insertion Swiss Institute of Bioinformatics. It is or deletion in sequence alignment. the home of SWISS-PROT, the annotated protein database. GenBank developed and housed at NCBI, GenBank is the U.S. FASTA simple text format for DNA repository for all DNA and protein or protein. sequences.

Field in the context of databases, a Gene ID a unique alphanumeric well-delimited part of an entry identifier assigned to a gene. containing information of a precise nature (author, date, etc.). You can Gene ontology a collaborative effort normally search databases by of investigators to unify and explicitly targeting specific fields. standardize terms associated with the role a gene or protein plays in an Fileserver a computer that provides organism. Represented model files to other computers via a include Drosophila network. melanogaster, Saccharomyces cervisiae, Schizosaccharomyces Finished sequence DNA has been pombe, Mus musculus, Arabidopsis sequenced at least eight times and thaliana, Caenorhabditis elegans, there are no gaps, and contains no Rattus norvegicus and Dictyostelium more than 1 error in 10,000 base discoideum. pairs. Genome the complete set of genes FPKM Fragments Per Kilobase Of or genetic material present in a cell Per Million Fragments Mapped; or organism. approximates the relative abundance of transcripts in terms of fragments Genome browser is a graphical observed from an RNAseq interface for display of information experiment. from a biological database for genomic data. FTP, file transfer protocol, an internet protocol that defines a

5 FR3 Bioinformatics Primer v1 June 23, 2014

Genomic colocation comparison of genes, tumors or other objects into genome structure between two or dendrograms. more genomes. High-throughput methods that Genomics a vague term that produce large volumes of data and encompasses the study of reference can process many samples quickly. genome sequences, variations within Robots and computerized data a species’ genome, DNA collection are common themes in microarrays, circuits and systems high-throughput methods. biology. Some people include wider areas such as proteomics, HiSeq Illumina sequencing platform , etc., under the that generates 120 Gb sequence per genomics umbrella. run with paired read lengths of 125 bp. GFF, generic feature format a data format for identifying the features of HiSeq2500 Illumina sequencing a sequence that involves data platform that generates 1,000 Gb featured in a tab-delimited file with sequence per run with paired read one feature per line to enable text lengths of 125 bp. manipulation and data analysis tools that work with tabular data. Hits shorthand for sequences returned when searching a database Global alignment an alignment of such as NCBI. two sequences where no amino acid or nucleotide is discarded. They are HMM, Hidden Markov Model, a all either aligned with other amino mathematical formulation of a acids/nucleotides or aligned with succession of hidden, mutually gaps. exclusive properties (such as or exon) associated with one Haplotype a collection of alleles in sequence or a multiple sequence one individual that are located on alignment. In the context of one chromosome. Alleles within a HMM is used interchangeably with haplotype are often inherited as a the words ‘motifs’ or ‘profiles’. In the single unit from one generation to context of DNA, HMMs are used to the next. In SNP studies, haplotypes predict the location and structure of refer to a group of genomic genes. variations found repeatedly in many individuals within a population. Homolog/ – two sequences (DNA or amino acid) that Heat map graphical representation are similar due to evolutionary of data where the individual values relatedness. contained in a matrix are represented as colors. Horizontal transfer the movement of DNA from one species to another Hierarchical clustering a method without sexual transmission; for organizing large numbers of mechanism uncertain.

6 FR3 Bioinformatics Primer v1 June 23, 2014

Host A computer that allows users to Indels collective noun that refers to communicate with other computers insertions or deletions of bases in or hosts on a network. DNA sequences.

HTML hypertext markup language, Induced a gene with increased the standard language used for transcription creating hypermedia documents within the World Wide Web. Intergenic sequence DNA sequence between two genes, HTTP hypertext transfer protocol, the sometime referred to as ‘junk DNA’. standard language that World Wide Web clients and servers use to Internet a global collective of communicate. computer networks running TCP/IP.

Hydropathy plot (Kyte-Doolittle) a InterPro a federative database of acomputer-generated graph that profiles, patterns and HMMs for uses the amino acid sequence to detecting protein domains. predict whether or not the protein will span a membrane. Intron portion of the gene and initial RNA transcript that will be excised Hypermedia Hypertext that includes and not included in the mRNA; links to other forms of media. usually begins with the sequence GT and ends with AG. Hypertext The linking of one document to another document or to Ion Proton II sequencing platform another location within the same available through Gibco document. Technologies, generates 60 Gb sequence per run, with paired read Identity percentage of identical lengths of 150 bp. amino acids or nucleotides in the alignment of two sequences. Ion Torrent sequencing platform Usually the identity is measured on available through Gibco Life the aligned residues with gaps Technologies, generates 30-120 Mb ignored. sequence per run, with paired read lengths of 400 bp. Illumina a company located in San Diego, CA that provides sequencing Isoforms/isozymes two versions of platforms such as MiSeq, HiSeq, highly similar proteins, isoforms refer HiSeq2500, NextSeq and GA II. to any proteins, isozymes is used for enzymes. In silico experimental process performed on a computer and not by Isoelectric point (pI) when the net bench research. charge of a protein is zero the pH of the local environment will be equivalent to the isoelectric point of the protein.

7 FR3 Bioinformatics Primer v1 June 23, 2014

Iteration when a process is repeated Low-complexity filtering removal of in an attempt to reach the ideal low-complexity sequences when outcome. Each iteration is slightly doing a database search. different from the previous one since we learn from the first and improve Low complexity sequence a the second iteration. sequence that contains a few elements repeated many times. Jalview a popular tool for editing and analyzing multiple sequence Macroarrays/Microarrays, see alignments. Arrays

Java applet a small piece of Mapping graphical representation of software that your browser installs where genes are found on automatically and that runs on your in relation to each computer. other.

Kb kilobase, 1,000 base pairs of , MS a DNA sequence. technique that allows investigators to separate proteins based on their KEGG, Kyoto Encyclopedia of mass to charge ration (m/z). The Genes and Genomes, a world- m/z for each protein allows them to famous Japanese database on be identified and quantified from genomes and biochemical pathways. complex mixtures. This proteomics tool is often used in pairs and called Kyte-Doolittle plot, see Hydropathy tandem mass spectrometry plot. (MS/MS). Protein samples are first ionized then inserted into MS/MS by LAN local area network, two or more either laser-based methods (MALDI, computers connected together via SELDI) or an electrospray method cabling. (ESI).

Lateral transfer, see Horizontal Megabase (Mb) 1,000 kilobases or 1 transfer, a gene acquired from million bases of DNA. another organism as opposed to inheriting it from an ancestor. Metabolic pathway the sequential reactions that taken together lead Linkage when two genes are from one substance to another. located near each other on the same chromosome. Metabolome term coined to encompass the entire metabolic Local alignment an alignment of the content of a cell or organism. most similar segments between sequences. Dissimilar sequences Metadata Data concerning or are not considered and are removed describing some core data, as in the final output. BLAST does local distinct from the primary data that is alignments. being described. This includes

8 FR3 Bioinformatics Primer v1 June 23, 2014 metadata on the origin, source, Multiplex when a series of reagents history, ownership or location of are mixed in a single tube so more some thing. than one outcome will be produced simultaneously; multiplex PCR Microarray technology used for produces several different sized parallel and DNA bands of DNA that can be detected. homology analysis using an orderly arrangement of thousands of NCBI National Center for identified sequenced genes printed Information a on a solid chip. federally funded part of the U.S. National Library of Medicine. NCBI Microsatellite a short segment of is the home of GenBank, BLAST, DNA (2-50 bases) repeated multiple COG, and many other genomic times. Microsatellites vary in length databases and computational tools. and base composition which makes them useful tools for distinguishing Neighbor joining, NJ the most members of a population. popular method for reconstructing phylogenetic trees. MiSeq Illumina sequencing platform that generates 15 Gb sequence per NextSeq Illumina sequencing run with paired read lengths of 300 platform that generates 120 Gb bp. sequence per run with paired read lengths of 150 bp. Molecular function coined by Gene Ontology, describes tasks performed Normal distribution data have an by individual gene products such as average value around which all the transcription factor or calcium individuals are clustered. A graph of transportation. normal distribution results in the classic ‘bell curve’. Motif short recurring patterns in DNA that are thought to have a biological Normalized data that has been function. corrected or standardized, for example, by subtracting the sample Multidimensional scaling the mean from each observation, and purpose of multidimensional scaling dividing the result by the sample (MDS) is to provide a visual standard deviation. representation of the pattern of proximities (i.e., similarities or NR the non-redundant protein distances) among a set of objects. database that contains all the The relationship between input putative protein sequences proximities and distances among contained in the nucleotide points on the map is positive: the databases. Its European equivalent smaller the input proximity, the is TrEMBL. closer (smaller) the distance between points, and vice versa.

9 FR3 Bioinformatics Primer v1 June 23, 2014

Offline describes actions or tasks Parsimony a technique for performed when not connected to reconstructing phylogenetic trees. another computer via a network. Patterns conserved residues that Oligonucleotide (oligos) ssDNA one can use as a functional polymers of unspecified length. The signature. oligo sequence is determined by the investigator and synthesized . Perl script a program written in Perl, Oligos are used to probe blots, prime which is optimized for manipulating sequencing reactions, and PCR, as and finding patterns in sequences. well as for spotting on DNA microarrays. Pfam a collection of profiles for detecting domains and protein Online describes actions or tasks families. performed when connected to another computer via a network. PIR protein information resource, an annotated protein database similar to ORF, Open reading frame a portion SWISS-PROT. PIR is also the name of a cDNA or gene that begins with a of a sequence format that is similar and ends with the stop to FASTA. codon. Synonym for coding sequence (CDS) on GenBank Polymorophism alternative forms of results. genes and other sequences.

Overexpress when genes are Postgebnomic era the time in bioengineered to produce excessive biology after entire genomes have amounts of protein, they overexpress been sequenced routinely. The their encoded proteins. postgenomic era began around the year 2000. Orthologue two genes in different species that are evolutionarily Postranscriptional control after a related. gene has been transcribed the RNA can be modified and regulated, PacBio RSII sequencing platform processes that constitute available through Pacific Biosciences posttranscriptional control. in Menlo Park, CA that generates 250 Mb sequence per run with an Protein data bank (PDB) database average read length of 4,500 bp. of every protein for which the 3D structure is known, it also contains a Pairwise alignment alignment of few non-protein structures. two sequences. Protein mass fingerprinting (PMF) Paralogue two genes within the identification of proteins by matching same species are called paralogs if the masses of contained in they evolved from the other (evolved the protein with those of known from one ancestral gene). proteins in a database.

10 FR3 Bioinformatics Primer v1 June 23, 2014

Protein microarrays proteomic environment and produces new methods similar to DNA microarrays proteins and at different amounts. in size and scale’ proteins are spotted onto glass and are used to Proteomics the study of determine protein interaction or to that includes determining the 3D identify and quantify molecules found shapes of proteins, their roles inside in various solutions. cells, the molecules with which they interact, and defining which proteins Protein profile a tool for visualizing are present and how much of each is a particular property (hydrophathy, present at a given time. charge, etc.) along a particular sequence by using a sliding window segments of DNA technique. that resemble genes by their sequence of bases but are Protein export domains sequence nonfunctional. Pseudogenes often of nucleic acids in a transcript that have transposons inserted in them, facilitate its transport through nuclear or they may have other pores. that led to their inability to encode a functional protein. Protein microarray a high- throughput method used to track the Public domain software that can be interactions and activities of proteins, freely use, copied, distributed and and to determine their function, and modified (freeware, shareware). determining function on a large scale. Its main advantage lies in the Pubmed an extensive database of fact that large numbers of proteins biomedical literature hosted by NCBI can be tracked in parallel. The chip that is searchable. You can consists of a support surface such as subscribe to PubCrawler and a glass slide, nitrocellulose automatically search PubMed and membrane, bead, or microtitre plate, receive e mail results on a schedule to which an array of capture proteins of your dhoosing. is bound. Probe molecules, typically labeled with a fluorescent dye, are p-value probability associated with a added to the array. Any reaction statistical test of the difference between the probe and the between populations. Populations immobilised protein emits a are considered significantly different fluorescent signal that is read by a if the associated p-value is small laser scanner. (typically 0.1 or smaller).

Proteome the complete collection of Query the sequence used to scan a proteins in a cell//organism at a database for sequence similarity. particular time. Unlike genomes that are stable over the lifetime of the qPCR quantitative PCR, synonym organism, proteomes change rapidly for real-time PCR. as each cell responds to its changing

11 FR3 Bioinformatics Primer v1 June 23, 2014 real time PCR (RT-PCR) a (miRNA) and short inhibitory RNA molecular method that is sometimes (siRNA). confused with reverse transcriptase PCR. Real time PCR uses the RNAseq use of deep sequencing specificity of PCR to measure the technologies for number of template molecules in profiling. your starting material. Router a device that directs or Reductionist a person who dissects routes data between networks or a complex system into increasingly different parts of a network. smaller parts in order to understand it. RPKM , Reads per kilobase per million reads, a measure of relative Reference a reference genome was molar RNA concentration from an sequenced first for a species and RNAseq sample. thus represents a standard but not necessarily ‘normal’ example. The SAM file a text format for storing term ‘reference’ implies that sequence data in a series of tab variations exist within the population, delimited ASCII columns. but the reference is used as a common point for comparison. Sanger sequencing dideoxy DNA sequencing method. Reiterative a process can be described as reiterative if you keep Scaffold a collection of contigs trying to improve the quality with lumped together into one larger each successive attempt. For contig. example, models that describe how genes are regulated are reiterative Secretome all the secreted proteins because each version of the model of a cell/tissue/organism. is built upon more information so the model gradually approaches the Sequence-tagged site (STS) truth. unique locus in a genome defined by PCR primers to amplify a single Repressed when the level of gene locus, used as markers to define transcription is reduced. chromosomal positions and to map genomes. beginning with a gene sequence and deducing its Server a combination of both function afterwards; the opposite of computer hardware and software traditional genetics. that supplies a service to requests submitted by client computers. The RNAi RNA interference, short server processes the requests and dsRNA capable of inactivating genes then returns the results to the client. by blocking the production of the In short, it is a computer that makes encoded proteins via microRNA services available on a network. For

12 FR3 Bioinformatics Primer v1 June 23, 2014 example a fileserver makes files Singleton any object that is not available. included in a collection of objects of its type, usually because it is not Shotgun sequencing a strategy for similar to any of the objects under sequencing whole genomes consideration. pioneered by the company Celera. Genomes are cut into very small Small nuclear (snRNAs) pieces, cloned into plasmids, example of a noncoding RNA. sequenced, and then assembled into whole chromosomes or genomes. Small nucleolar RNAs (snoRNA) This method is faster than example of a noncoding RNA. hierarchical shotgun sequencing but is more prone to assembly errors. SMTP simple mail transfer protocol, defines the manner by which e-mail Signal or signal sequence is transferred between computers on hydrophobic in nature, the first 20 the Internet. amino acids of proteins synthesized that pause until the SNP, single nucleotide docks with the rough polymorphism DNA sequence endoplasmic reticulum. variations that occurs when a single nucleotide in the genome sequence Signal transduction conveyance of is altered. information from the outside to the inside of the cell. When a ligand Splice site junction intron/exon binds to its receptor, the information boundaries determined by is conveyed to the rest of the cell transcriptome analysis of genes. through a complex pathway of signal tranasduction that involves second Stochastic defines a program that messengers. uses chance to compute its results, common in genetic algorithms. Similarity percent of similar amino acids in the alignment of two Submission the data you send to a sequences. Two amino acids are server. similar if they have similar physicochemical properties. Substitution matrix matrix that contains a numerical score Single nucleotide associates with the cost or reward of polymorphisms/SNPs very similar each possible or to point mutations except SNPs are conservation. Unlikely mutations are considered to represent the genetic penalized with a negative score. variation present in wild type genotypes. By definition, SNPs SWISS-PROT one of the most differ from the reference sequence of extensive annotated protein a species. databases available.

13 FR3 Bioinformatics Primer v1 June 23, 2014

Synteny multiple generic loci from Transcriptome coined to describe different species located on a the complete RNA content of a chromosomal region of common cell/tissue/organism and often evolutionary ancestry. measured by DNA microarrays or RNAseq. Synthetic biology the use of engineering principles to construct Transmembrane domain the small biological devises by portion of a protein that spans a assembling new DNA circuits inside phospholipid bilayer, typically about cells. 20 amino acids long and predominately hydrophobic. /systems approach coined to denote the new Transposons sometimes referred to perspective for research in the as ‘jumping genes’, segments of postgenomic era. Systems biology DNA that can move from one place studies whole in a genome to another. cells/tissues/organisms not by a traditional reductionist approach but TrEMBL translated EMBL, European by holistic means in a reiterative equivalent to NR. attempt to model the complete cell/tissue/organism. t-test statistical method for determining whether a mean or the tBLASTx compares the six-frame difference between two means is translations of a nucleotide query significantly different than the sequence against the six-frame hypothesized value. translations of a nucleotide sequence database. URL, uniform resource locator, a standardized way of representing TCP/IP transmission control different documents, media, and protocol/internet protocol, network services on the World Wide The software communication Web. It gives every document of the protocol of the Internet. One Internet its own unique address. computer communicates with URLs are most commonly another computer through the associated with Web sites but also Internet using TCP/IP format. apply to e-mail servers, Gopher servers and any other computer with The Institute for Genomic an Internet address. Research (TIGR) Maryland-based genomics and proteomics nonprofit (UTR) portion organization that produced public of mRNA that is not translated’ found domain genome sequences and on 5’ and 3’ ends of mRNA. analysis software. Upstream a relative direction for Transcript a sequence of RNA nucleic acids often used to describe produced by transcription. the location of a promoter relative to the start transcription site. For

14 FR3 Bioinformatics Primer v1 June 23, 2014 example, the start codon is upstream information and resources on the of the . Internet.

Venn diagram a diagram Yeast artificial chromosome (YAC) representing mathematical or logical a cloning vector that replicates in sets pictorially as circles or closed yeast and can contain inserts about curves within an enclosing rectangle 1 Mb in size. (the universal set), common elements of the sets being represented by the areas of overlap among the circles.

WAIS, wide area information server, a service that allows users to intelligently search for information among databases distributed throughout the Internet.

WAN, wide area network, a group of geographically separated computers connected via dedicated lines or satellite links.

Webmaster the administrator responsible for the management and often design of a World Wide Web site.

Whole genome shotgun (WGS) a genome sequencing strategy that skips the mapping stage and uses computers to reassemble huge numbers of independent sequencing runs that were generated from many random fragments from a genome.

Wild-type (wt), an allele, genotype or phenotype that is considered to be the standard for a given strain or species. Wild type alleles encode functional proteins and produce typical phenotypes.

WWW, World Wide Web, the initiative to create a universal hypermedia-based method to access

15