<<

analysis and predictions

EMBO COURSE

Practical Course on Genetic and Molecular Analysis of

Module 3

Genome analysis and in silico functional predictions

Barbet J.C., Chiapello H., Cooke R., Lecharny A., Ollivier E. and Rouzé P.

3 - 1 Genome analysis and predictions

3.1 Introduction ...... 3

3.2 Sequence databases...... 4

3.3 Arabidopsis resources ...... 5

3.4 How to access and search the biological data ...... 6

3.5 Sequence similarity searches...... 8

3.6 Rapidity, selectivity, specificity and exhaustivity ...... 9

3.7 How to find in a genomic sequence and what are the annotations ? ...... 10

3.8 What may be inferred about an uncharacterized ? ...... 10

3.9 A new database, Indigo...... 11

3.10 Readings ...... 12

3.11 Bibliography...... 12

3 - 2 Genome analysis and predictions

3.1 Introduction

The size of the nuclear genome of is estimated to be 100-140 Mb organized in 5 . http://www.cbs.dtu.dk/databases/DOGS/index.html Started in the early 90’ as a word wide coordinated effort (1), the of this model genome is progressing very rapidly. At the time of this course, May 99, more than 50% of the genome will be available in public databases. Progress of the sequencing project can be monitored through the Arabidopsis thaliana data base: http://genome-www3.stanford.edu/cgi-bin/Webdriver?MIval=atdb_agi_total or the MIPS (Munich Information center for Protein Sequences, Max-PlancK Inst. für Bio- chemie, Martinsried, Germany) WWW server: http://www.mips.biochem.mpg.de/proj/thal/ The complete sequence is announced for the year 2001 by the steering commitee of the Arabidopsis Genome Initiative. The complete mitochondrial DNA sequence (57 identified genes in 366,924 nucleotides) is also available (2): http://megasun.bch.umontreal.ca/gobase/gobase.html) This places Arabidopsis workers in a privileged but uncomfortable situation. Indeed, it is a great advantage to know so much of the sequence of the organism you are working on. But the daily release of raw sequences by a number of sequencing facilities makes it extremely difficult for unskilled or unassisted biologists to keep up with available data.

Nowadays, , in the instrumental meaning of analysis and interpretation of sequence data, should play a key role in the design of most if not all biological experiments. Thus, for a biologist, bioinformatics is no longer an optional subject but needs to be part of fundamental training. The aim of this part of the course is to give insight in biocomputing tools which are necessary in many studies where a sequence is a step in the protocol. This is evident in (ordered DNA libraries, SAGE) or approaches but it is also essential in a number of methods like screening of T-DNA mutants, in silico cDNA cloning, walking, specific PCR amplifications etc. The use of such tools is of great help in function identification prior to experimental verification.

There are a number of Web sites in France and neighbouring countries which propose biocomputing tools and links to servers all around the word:

GIS-INFOBIOGEN (INFOrmatics for BIOmolecules and ), a national academic centre, VILLEJUIF, France: http://www.infobiogen.fr/services/deambulum/fr/

The Pasteur Institute, Paris, France: http://www.pasteur.fr/outils-uk.html

The “Pôle Bio-Informatique Lyonnais”, Lyon, France. The Laboratory of Biometry and Evolutive and the Institute of Biology and Chemistry of : http://pbil.univ-lyon1.fr/pbil.html

The “Atelier Bioinformatique de Marseille”, France, provides a comprehensive and up-to- date list of sites proposing a wide variety of biocomputing tools: http://www-biol.univ-mrs.fr/english/logligne.html/

The “Institut de Génétique et de Biologie Moléculaire et Cellulaire de Strasbourg”, France, gives an access to SRS, a Sequence Retrieval System on the World Wide Web. Also, to get :

3 - 3 Genome analysis and predictions SeqCleaner, a programme that removes end of sequences with too many N’s or vector sequences. DBWatcher, a programme handling periodic BLAST searches and reporting novel similarities. ClustalX, a windows interface for the ClustalW multiple programme (3). http://www-igbmc.u-strasbg.fr/PagedeGarde.html

A map and a list to click to access the EMBnet members WWW servers. EMBnet, the European network, is a science-based group of 26 collaborating nodes throughout Europe that provide data and software accessibility to the European molecular biology community. http://biomaster.uio.no/embnet-www.html?86,29 http://bigben.vub.ac.be/embnet.news/vol5_2/nodelist.html

The Department of , Gent, Belgium: http://www.plantgenetics.rug.ac.be/

The Expasy server of the Swiss Institute of Bioinformatic, Genève, Switzerland. Protein sequences and structures. http://expasy.hcuge.ch/

The Pedro server provides a collection of WWW links to information and services useful to molecular biologists. http://www.fmi.ch/biology/research_tools.html http://www.biophys.uni-duesseldorf.de/bionet/research_tools.html

As in wet-lab procedures, to be efficient, we have to make the best choice at each step of an analysis and this requires both an overview of the possibilities and a basic understanding of the method being applied. Most sites proposing on-line analysis have built-in, hypertext-linked help pages which provide information on the programme(s), databases available and the parameters which can be used in analysis. Due to the shortage of time, only the analysis of the primary structure, i.e. the sequence, will be considered even if it is obvious that, when possible, sequence analysis should proceed further by gaining information on the structures at the 3D level.

Below we present short introductions to what will be discussed and experimented. The aim is to introduce various terms and notions frequently used in biocomputing. Deliberately, only the tools and resources freely available through the INTERNET are considered.

3.2 Sequence databases

Every year, the first issue (January 1999, 27-1) of Nucleic Acids Research is especially dedicated to biological databases.

DNA: All publicly available DNA sequences (>3,000,000) are collected in three sequence databases, DDBJ, EMBL and GenBank, exchanging data on a daily basis. DNA sequences come primarily from direct submission of sequence data from individual laboratories and

3 - 4 Genome analysis and predictions large-scale sequencing projects. From France these three databases may be efficiently searched at INFOBIOGEN. http://www.infobiogen.fr

EMBL: European Bioinformatics Institute, Hinxton, UK. http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/nar/gkc070_gml.html

GenBank: National Center for Biotechnology Information, Bethesda, MD, USA. http://www.ncbi.nlm.nih.gov/

DDBJ: DNA DataBank of Japan, Mishima, Japan. http://www.ddbj.nig.ac.jp/

Proteins: Swiss-Prot (4): http://www.expasy.ch/ The SWISS-PROT Protein is a curated protein sequence database which provides a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc).

TREMBL(4) (TRanslated from EMBL) and GenPeptide: CDS features from EMBL and GenBank nucleotide sequence DataBases as translated peptide sequences. Useful when these sequences are not yet integrated in SWISS-PROT.

The Protein Information Resource (PIR): http://www.mips.biochem.mpg.de PIR International is a collaboration established in 1988 between the National Biomedical Research Foundation (NBRF), the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID). It is a curated data base divided into three sections depending on the degree of verification of sequences (from unverified to fully classified). Sequences are organized according to structural, functional and evolutionnary relationships.

3.3Arabidopsis resources.

The principal Arabidopsis thaliana resources are available from or linked to the AtDB at Stanford University School of Medicine, Stanford, CA, USA. http://genome-www.stanford.edu/Arabidopsis/ AtDB collects all the Arabidopsis sequences and also provides a number of tools for sequence analysis. AtDB also includes interactive physical and genetic maps, the latest AGI sequencing information, colleague details, literature, clone and locus data and important information relevant to the Arabidopsis community. AtDB has recently released a new version of the Unified Display of Physical Maps. http://genome-www.stanford.edu/Arabidopsis/maps.html The display contains BAC and YAC tiling paths and a number of physical maps from many research groups. It is very useful for map-based cloning.

3 - 5 Genome analysis and predictions

In general databases, about 40,000 Arabidopsis ESTs (5) (EST = Expressed sequence tags, 250-500 bases) are available in dbEST. An EST is a single-pass sequencing of a cDNA (5). It is a rapid means of obtaining information on coding sequences. http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html. TIGR has assembled Arabidopsis ESTs with Arabidopsis transcripts into Tentative Consensus (TC) sequences and provides the results as a service to the community http://www.tigr.org/tdb/agi/

These partial cDNA sequences that may contain errors (up to 5%) allow the detection of similarities with known sequences in databases, the design of -specific probes to be used in expression studies and for mapping experiments. They are also an easy way to a (full-length or partial) cDNA. Once you have found the reference (either clone name or accession number) of a cDNA tagged by an EST, you may order the cDNA clone from the Arabidopsis Biological Resource Centre (ABRC) at Ohio State University, USA. http://aims.cps.msu.edu/aims An other Biological Resource Centre for Arabidopsis is the Nottingham Arabidopsis Stock Centre (NASC). http://nasc.nott.ac.uk/ Resources available at NASC include of mutants, and T-DNA lines. It is also the curator of the “Lister and Dean” Recombinant Inbred map.

BAC-end sequences are sequences of the extremities (< 1kb) of Bacterial Artificial Chromosomes. They are consultable at several sites including the TIGR, Rockville, MD, USA http://www.tigr.org/tdb/at/abe/bac_end_search.html the CNS, Paris, France http://www.genoscope.cns.fr/

An increasing number of databases are dedicated to a given gene family with a classification by species. For instance: Secretory peroxidase genes in A. thaliana http://biobase.dk/~welinder/prx.html Cytochrome P450 family in A. thaliana (>350 genes) http://drnelson.utmem.edu/Arablinks.html

3.4 How to access and search the biological data.

In databases, sequence entries are made of the sequence itself and a number of informations either describing the sequence (features) or associated to the sequence (reference, comment etc.). On next page is a short comparison of the organization in fields of typical data forms from EMBL and GenBank. Note that framed parts may be repeated a number of times. There are two kinds of searches. In a text based search we use words we hope to have been used in the descriptions of the sequence. Three retrieval systems allow to select sequences from many criteria and to extract selected sequences. The three logical operators OR, AND and BUTNOT can be used to combine search words in an index search. http://www.infobiogen.fr/srs/ http://www.ncbi.nlm.nih.gov/Entrez/index.html http://pbil.univ-lyon1.fr/databases/acnuc.html

3 - 6 Genome analysis and predictions As an increasing percentage of Arabidopsis entries in databases is made of large sequences registered by sequence facilities invoved in the AGI, very often your query nucleotide sequence will be similar to only a small part of the subject sequence. For example, Arabidopsis BAC or P1 sequences are around 100 kb long, enough to contain about 20 genes. The features, containing the positions of each of the (predicted)genes, help you to locate the region of interest in large annotated sequences.

Example: http://www.infobiogen.fr/srs/ start new SRS session select embl and GenBank continue select AccNumber Y11187 Do Query >EMBL:AT23KB >GenBank:AT23KB

The features table contains : information about potential gene products, regions of biological significance and cross-references to other data collections. http://www.ebi.ac.uk/ebi_docs/embl_db/ft/feature_table.html http://www.ncbi.nlm.nih.gov/collab/FT/index.html

Data in the feature tables are of highly different quality levels since they are produced by different methods (from informatics predictions-annotations to experimental evidences by biology) and provided by different authors (6). Expert curated databases exist (Swiss-Prot) but they are not the rule.

3 - 7 Genome analysis and predictions

GenBank embl

LOCUS ID+DT ATXXXXDNA nnnn bp DNA PLN 01-01-2000

DEFINITION DE A.thaliana XXXX gene

ACCESSIONAC Annnnn

KEYWORDS KW XXXX gene

SOURCE thale cress ORGANISM OS+OC Arabidopsis thaliana Eukariotae; mitochondrial eukariotes…..Arabidopsis

REFERENCE RN+RP 1 (bases 1 to nnnn) AUTHORS Dupont et al. TITLE Regulation of ….. JOURNALRL Unpublished

COMMENT NCBI gi: nnnnnn

FEATURES FH keyLocation/Qualifiers SOURCE FT source 1..xxxx /organism=”Arabidopsis thaliana” /strain=”Columbia” /map=”4-xx” CDS FT CDS join (xxxx..xxxx, …….) /translation=”MAPTETTGS…….

EXON FT xxxx….xxxx /gene=”XXXX” /number=1 INTRON FT intron xxxx….xxxx /number=1

BASE COUNT 1521 a 939 c 1049 g 1946 t

ORIGIN SQ 1 gtcaagtggtaaccggtcaacgtagccat………………….. 2 ……………………………………….………….. 3 …………………………………………………... //

3 - 8 Genome analysis and predictions

In a sequence based search, we use a nucleotide or a protein sequence (the query) to search a sequence database. Due to the non standardization of the descriptions of the sequences, and because efficient links have been established between a number of databases, when possible, a sequence based search is often the best way to collect the desired information. The query sequence may be either represented by the “Accession number” it has received in the EMBL/GenBank/DDBJ databanks or is the sequence itself in an defined format. Two formats are largely used:

the FASTA format,

>sequence_name or any thing you want ATGCATGCATGCATGCATGCATGCATGCATGCA….. ………….

and the plain (raw) text format,

ATGCATGCATGCATGCATGCATGCATGCATGCA….. ………….

As a first step in sequence analysis it is best to use a programme performing local alignment, which will only look for similar blocks of sequence between your query and data base sequences (subject sequences) rather than attempting to align both sequences from end to end (global alignment).

3.5 Sequence similarity searches.

For local alignment the BLAST (Basic Local Alignment Search Tool, ref 7) suite of programmes has become a worldwide reference for such analyses and is almost always used to begin sequence comparison. http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html BLASTs search for alignments locally maximal using a scoring method described in the “help” of BLAST. http://www.ncbi.nlm.gov/BLAST/blast_help.html

The five BLAST programs perform the following tasks: blastp compares an amino acid query sequence against a protein sequence database; blastn compares a nucleotide query sequence against a nucleotide sequence database; blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

3 - 9 Genome analysis and predictions tblastx compares the six-frame translations of a nucleotide query sequence against the six- frame translations of a nucleotide sequence database.

The result of a sequence based search is (by default) organized following the level of similarity between your query sequence and a number of sequences of the searched databases (subject sequences). Depending on the BLAST you use, the statistical significance of the alignment between a query sequence and a subject sequence is indicated by: -the HS or High Score: the score of the best alignment. By itself the HS is not useful to evaluate an alignment unless the specific scoring matrix employed is also provided. -the P(N) value (old-ungapped BLAST1.4.11): the statistical significance based on all the N alignments, with a HS above a given value, between the query and a subject. P(N) is given following the scientific notation, thus, 3.5 e-25 = 3.5 10-25. P(N) varies from 0 to 1. The lower P(N) value correspond to higher statistical significance of the similarity between the two aligned sequences. -the BLAST E value (new-gapped BLAST2.0): an E value of 1 assigned to a hit between a query sequence and a subject sequence can be interpreted as meaning that in the searched database one might expect to see 1 match with a similar score simply by chance. Both P(N) and E value depend on the scoring matrix, the length of the query sequence and the total length of the database. The size of the database searched is indicated in the introductory BLAST output. Always remember that P(N) or E values are nothing but statistics. The biological significance of an alignment cannot be simply derived from these values but it needs validation.

More specialized forms of BLAST are:

BLAST-PSI (Position Specific Iterated)(8): Proteins only.

query sequence gapped BLAST search position specific score matrix from previous alignments

iteration database search

BLAST-PHI (Pattern Hit Initiated)(9): Proteins only.

Search for protein motifs and for similarity in the vicinity of this motif in subject sequences. It is an efficient method to find homologous proteins since it is linked to BLAST-PSI. The motif sequence should be in PROSITE format: http://expasy.hcuge.ch/sprot/prosite.html The motifs may come from the PROSITE database or from the BLOCKS database. http://www.blocks.fhcrc.org It may also be built from your own family of sequences with BLOCKS MAKER.

http://www.blocks.fhcrc.org

BLAST2sequences: to compare 2 sequences only.

3.6 Rapidity, selectivity, specificity and exhaustivity.

3 - 10 Genome analysis and predictions

The rapidity, sensitivity and specificity of a sequence based search are all critically increased if coding DNA is conceptually translated to protein before performing the search and if regions of low complexity are not considered (see the Filter item in the BLAST help). Such a gain in the quality of the result of a search may be also obtain by restricting the database size. The rapid increase in the number of entries in sequence databases as well as the submission of large amounts of lower-quality, single-pass or unfinished sequence has led to the creation of independent subdivisions of the data. ESTs are in the separate dbEST whereas high throughput genome sequences are initially presented in the HTG(S) or GSS (Genome Survey Sequences) sections. In addition, HTGS sequences are divided into three subsections: HTGS_PHASE1 : Sequence consists of an unordered set of sequence pieces (typically 7- 20), unoriented, unannotated and containing gaps. HTGS_PHASE2 : Sequence consists of sequence pieces (typically 2 or 3) for which order and orientation have been established, while gaps remain. HTGS_PHASE3 : Sequence is considered to be completed and might contain (some) annotation.

GenBank divisions: These change relatively frequently and currently include non-redundant sequences, ESTs, HTGS, GSS and more recently, -specific searches (by name, group, etc.).

NCBI-What’s New 10/19/98 http://www.ncbi.nlm.nih.gov/Web/Whats_New/index.html Organism-specific BLAST is now available at the NCBI. Users may limit their BLAST search to a specific organism selected from a pull-down menu of common organisms or by entering an organism name ( species) or a taxonomic group (e.g., "Eukaryota"). Arabidopsis thaliana lineage (short): Eukaryota; Magnoliophyta (flowering ); Eudicotyledons; Rosidae; Capparales; ; Arabidopsis.

Another subdivision which it may be wise to consider are the month section (all new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days).

EMBL divisions : The EMBL data base has always been divided into subsections (e.g. BAC, mammals MAM, plants PLN, etc.) and new ones have been added, such as the EST or HTG subsections. nrDB: The non-redundant (nr) data bases built up by some centers of sequence data resources contains all classically-sequenced or high quality systematic sequencing data from GenBank+EMBL+DDBJ+PDB after suppression of redundant entries but generally without EST, STS (Sequenced Tagged Site), GSS, or HTGS sequences). These latter sections contain a large number of sequences (37,000 A. thaliana sequences in dbEST and 29,000 in GSS in February 1999). This means that, for exhaustive analysis, separate alignments against all relevant sections must be carried out since, depending on the server you use, the ESTdb, the BAC-end-sequences-db and the HTG(S)db may need separated searches.

3.7 How to find genes in a genomic sequence and what are the annotations?

3 - 11 Genome analysis and predictions Finding genes in a genomic sequence from an higher is not trivial. It is a process related to a number of biological issues like the mechanism of splicing, the biaised codon usage and homology. http://www.cbs.dtu.dk/services/NetGene2/ http://genemark.biology.gatech.edu/genemark/

Annotations are rapid and automatic functionnal assignments. They are possible only if similar sequences with known function exist, that is around 50% of the predicted genes in A. thaliana. Annotations, based on similarity searches, should be considered cautiously since a significant fraction of them in databases are wrong. Annotations and functional data that relate to sequence are stored in the features tables of the sequence.

3.8 What may be inferred about an uncharacterized protein?

From similarity to homology and to function. Much can be inferred about an uncharacterized protein when significant sequence similarity is detected with a well studied protein. Protein families are a powerful tool in the functional predictions. Clustering of protein by (super)family is the object of the PIR-PROTFAM database: http://www.mips.biochem.mpg.de Each family groups homeomorphic proteins with 50% of sequence identity. More than 10,000 families are recognized. Superfamilies (~2,000) clustered sequences with 30% identity. When a sequence similarity is observed in a search against the uncurated nr databases, always verify that the function of the subject sequence has been obtained experimentally and not only by similarity (error propagation). Do not rely directly on the highest score in a sequence comparison. Biologically significant similarities may be different of the statistical signification. Nevertheless, it should be borne in mind that functions may diverge as sequences diverge and thus the biological role may be altered even when biochemical function is retained.

Homologs: genes descending with modifications from a common gene ancestor. Orthologs: homologs with function conserved in different species. Generated by speciation events. Paralogs: homologs with different functions in the same species. Generated by duplication events and divergence.

Convergence: sequence similarities that have arisen without a common evolutionary history. Convergence is generaly suspected only in small regions or domains of genes.

Proteins are very often made of more than one recognizable domain. When the function of a protein cannot be found by an overall similarity with a well known protein, we may proceed by a protein domain to look for conserved, diagnostic “signatures” which are found in particular proteins or protein families.

PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. http://www.expasy.ch/

BLOCK Searcher compares a protein or a DNA sequence to a database of protein blocks. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. http://www.blocks.fhcrc.org/

3 - 12 Genome analysis and predictions In general, homologues share more than one Block and distances between two Blocks are frequently conserved. BLOCK Searcher output takes all these data into account and deliver finger prints i.e. a group of conserved motifs used to characterize a protein family. The Blocks may be retrieve with “Get Blocks”. You may create your own blocks from a set of homologous sequences with “Block Maker” .

PRINTS is an other compendium of protein fingerprints. http://www.biochem.ucl.ac.uk/bsm/dbbrowser/

PRODOM : The ProDom protein domain database consists of an automatic compilation of homologous domains. ProDom was built using recursive PSI-BLAST searches. http://protein.toulouse.inra.fr/prodom.html

Some WWW sites offer tools helping in the prediction of either peptide signals or transmembrane fragments. ChloroP, for transit peptides and their cleavage sites in plant proteins. SignalP, for signal peptide and cleavage sites. TMHMM, for transmembrane helices. http://www.cbs.dtu.dk/services/ TMpred, for prediction of membrane-spanning regions and their orientation. http://www.isrec.isb-sib.ch/software/TMPRED_form.html PSORT, for the prediction of protein localization sites in cells. http://psort.nibb.ac.jp:8800/

3.9 A new database, Indigo.

The database, Indigo (10), is open through http://indigo.genetique.uvsq.fr The concept used for organising the data is the concept of neighbourhood. The main idea underlying this work is that the biological objects making a cell alive cannot be isolated from each other : biology must be described more as a science of relationships between objects, than as a science describing objects. Knowledge of whole genome sequences is a unique opportunity to study the relationships between genes and gene products. In most cases, we ignore what relationships are involved, but we know that they exist. To study them we investigated the concept of neighbourhood in order to organise the disparate knowledge we have on a particular genome. This concept is very wide. Because we study the genomic text, we chose genes as the core items. For a given gene, we constructed lists of neighbours based on links of several possible categories. The first and intuitive relationship between two genes is their proximity in a chromosome. A second possibility , often used in classical studies is to related genes or gene products because they evolve from a common ancestor. We can also consider that genes coding for proteins involved in the same metabolic pathways or using the same substrat are related. This constitutes the metabolic neighbourhoods. More complex relationships have been described, such as relationships based on the genetic code utilisation or on common presence in bibliographical references: two genes can be related because they used synonymous codons whith the same frequency; they also can be linked because they are cited in the same bibliographical source. Indigo creates an interactive environment allowing to retrieve and exploit the knowledge about gene neighbours for model organisms (at present : E. coli and B. subtilis, and a preliminary compendium of A. thaliana genes).

3 - 13 Genome analysis and predictions 3.10 Readings.

Science, 1998 Oct 23, 282 (5389):651-688.

Trends guide to bioinformatics : Trends Supplement 1998 Genetwork, in various issues of Trends in Genetics.

Genome analysis: A laboratory manual. Vol 1 Chapter 7 Computational analysis and annotation of sequence data. Baxevanis A.D. et al. http://www.cshl.org/books/g_a/bk1ch7/

BIOSCI is a set of free electronic communication forums used by biological scientists worldwide. It contains a specialized section for Arabidopsis. http://www.bio.net/hypermail/ARABIDOPSIS/

3.11 Bibliography:

To load the following references and the corresponding summary directly into your own bibliography database, connect you to the servers for MEDLINE and use the MUI or PMID codes. http://www.infotrieve.com/freemedline/ http://www4.ncbi.nlm.nih.gov/PubMed/ References may also be searched at INIST by words or by shelf number. http://www.inist.fr

1) Bevan M. et al., 1997. Objective: the complete sequence of a plant genome. 9, 476-478. Bevan M. et al., 1998. Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis thaliana. Nature 391, 485-488.(MUI: 98121113). Bevan M. et al., 1999. Clearing a path through the jungle: progress in Arabidopsis genomics. Bioessays, 21:110-120. Sato S. et al.,1997-98. Structural analysis of Arabidopsis thaliana chromosome 5. I (MUI: 97471969); II (MUI: 98069011); III (MUI: 98162728); IV (MUI: 98290546). DNA Res.

2) Unseld M. et al., 1997. The mitochondrial genome of Arabidopsis thaliana contains 57 genes in 366,924 nucleotides. Nat Genet 15:57-61. (MUI: 97141919; PMID:8988169) (shelf number:22883).

3) Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research, 24:4876-4882.

4) Bairoch A. and Apweiler R., 1998. The SWISS-PROT protein sequence data bank and its supplement TREMBL. Nucleic Acids Res. 25, 31-36. (MUI 98062389).

5) Cooke R. et al., 1997. Further progress towards a catalogue of all Arabidopsis genes: analysis of a set of 5,000 non-redundant ESTs. The Plant Journal, 9, 101-124. (MUI: 96158348).

3 - 14 Genome analysis and predictions 6) Rouzé P. et al., 1999. Genome annotation: which tools do we have for it? Curr. Opin. Plant Biol., 2, 90-95.

7) Altschul S.F. et al., 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410. (MUI 91039304). Altschul S.F. et al., 1994. Issues in searching molecular sequence databases. Nature Genetics 6, 119-129. (MUI 94214490).

8) Altschul S.F., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 3389-402. (UI: 97402527).

9) Zhang et al.,1998. Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26:3986-3990. (UI: 98371237).

10) Nitschké P. et al., 1998. Indigo: a World-Wide-Web review of genomes and gene functions. FEMS Microbiology Reviews 22, 207-227. (UI: 99079114).

3 - 15