
ComparativeComparative GenomicsGenomics ComparativeComparative GeneGene PredictionPrediction inin thethe HumanHuman GenomeGenome MaribelMaribel HernandezHernandez RosalesRosales WhatWhat isis ComparativeComparative Genomics?Genomics? Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species hav e evolved and to determine the function of genes and noncoding regions of the genome. Researchers have learned a great deal about the function of huma n genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the lengt h and number of coding regions (called exons ) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans. Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them. WhatWhat areare thethe comparativecomparative genomegenome sizessizes ofof humanshumans andand otherother organismsorganisms beingbeing studied?studied? estimatedaverage gene chromosome organism estimated size gene numberdensity number Homo sapiens 1 gene per 100,000 (human) 2900 million bases ~30,000 bases 46 Rattus norvegicus 1 gene per 100,000 (rat) 2,750 million bases ~30,000 bases 42 Mus musculus 1 gene per 100,000 (mouse) 2500 million bases ~30,000 bases 40 Drosophila melanogaster 1 gene per 9,000 (fruit fly) 180 million bases 13,600 bases 8 Arabidopsis thaliana 1 gene per 4000 (plant) 125 million bases 25,500 bases 5 Caenorhabditis elegans 1 gene per 5000 (roundworm) 97 million bases 19,100 bases 6 Saccharomyces cerevisiae 1 gene per 2000 (yeast) 12 million bases 6300 bases 16 Escherichia coli 1 gene per 1400 (bacteria) 4.7 million bases 3200 bases 1 H. influenzae 1 gene per 1000 (bacteria) 1.8 million bases 1700 bases 1 EukaryoticEukaryotic GeneGene FindingFinding ComparativeComparative GeneGene PredictionPrediction GenScanGenScan :: abab initioinitio genegene prediction.prediction. GeneWiseGeneWise ,, ProcrustesProcrustes :: homologyhomology guided.guided. RossetaRosseta ,, SGP1SGP1 (( SynteticSyntetic GeneGene Prediction),Prediction), CEMCEM (Conserved(Conserved ExonExon Method)Method) :: genegene predictionprediction andand sequencesequence alignmentalignment areare clearlyclearly separated.separated. GenomeScanGenomeScan :: AbAb InitioInitio modifiedmodified byby BLASTBLAST homologies.homologies. SGPSGP --22,, TwinScanTwinScan ,, SLAM,SLAM, DoubleScanDoubleScan :: modificationmodification ofof GenScanGenScan scoringscoring schemaschema toto incorporateincorporate similaritysimilarity toto knownknown proteins.proteins. GeneScanGeneScan AA generalgeneral probabilisticprobabilistic modelmodel forfor thethe genegene structurstructuree ofof humanhuman genomicgenomic sequences.sequences. GeneGene identificationidentification byby identifyingidentifying completecomplete exon/intronexon/intron structuresstructures ofof genesgenes inin genomicgenomic DNA.DNA. IncludeInclude dede capacitycapacity toto predictpredict multiplemultiple genesgenes inin aa sequence,sequence, toto dealdeal withwith partialpartial asas wellwell asas completecomplete genes,genes, andand toto predictpredict consistentconsistent setssets ofof genesgenes occurringoccurring onon eithereither oror botbothh DNADNA strands.strands. MarkovMarkov ModelModel ofof codingcoding regions:regions: predictionspredictions dodo notnot dependdepend onon presencepresence ofof aa similarsimilar genegene inin thethe proteinprotein sequensequencece databasesdatabases andand complementcomplement thethe informationinformation providedprovided bbyy homologyhomology --basedbased genegene identificationidentification methodsmethods (BLASTX).(BLASTX). MaximalMaximal DependenceDependence DecompositionDecomposition (MDD):(MDD): newnew statisticalstatistical modelmodel ofof donordonor andand acceptoracceptor splicesplice sitesitess whichwhich capturecapture importantimportant dependenciesdependencies betweenbetween signalsignal positpositions.ions. exonic repressor Pre Pre 5 ’ splice signal U1 snRNP intronic enhancers - - intron definition mRNA Splicing mRNA Splicing branch signal polyY U2 snRNP 65 3 (assembly of spliceosome, catalysis) U2AF ’ splice signal U2AF35 exonic enhancers exon definition ... SR proteins 5 ’ splice signal intronic repressor U1 snRNP ... HiddenHidden semisemi --MarkovMarkov ModelModel (HMM)(HMM) GenScanGenScan HMMHMM N - intergenic region P - promoter F - 5’ untranslated region Esngl – single exon (intronless ) (translation start -> stop codon ) Einit – initial exon (translation start - > donor splice site) Ek – phase k internal exon (acceptor splice site -> donor splice site) Eterm – terminal exon (acceptor splice site -> stop codon ) Ik – phase k intron : 0 – between codons ; 1 – after the first base of a codon ; 2 – after the second base of a codon GenScanGenScan FeaturesFeatures ModelModel bothboth strandsstrands atat onceonce EachEach statestate maymay outputoutput aa stringstring ofof symbolssymbols (according(according toto somesome probabilityprobability distribution).distribution). ExplicitExplicit intron/exonintron/exon lengthlength modelingmodeling AdvancedAdvanced splicesplice sitesite modelingmodeling ParametersParameters learnedlearned fromfrom annotatedannotated genesgenes PredictionPrediction ofof multiplemultiple genesgenes inin aa sequencesequence (partial(partial oror complete).complete). GenomeScanGenomeScan WeWe cancan enhanceenhance ourour genegene predictionprediction byby usingusing externalexternal information:information: DNADNA regionsregions withwith homologyhomology toto knownknown proteinsproteins areare moremore likelylikely toto bebe codingcoding exonsexons .. CombineCombine probabilisticprobabilistic ‘extrinsic’‘extrinsic’ informationinformation (BLAST(BLAST hits)hits) withwith aa probabilisticprobabilistic modelmodel ofof genegene structure/compositionstructure/composition (( GenScanGenScan ).). FocusFocus onon ‘typical‘typical case’case’ whenwhen homologoushomologous butbut notnot identicalidentical proteinsproteins areare available.available. AbAb InitioInitio modifiedmodified byby BLASTBLAST homologieshomologies AbAb InitioInitio modifiedmodified byby BLASTBLAST homologieshomologies GeneWiseGeneWise MotivationMotivation :: UseUse goodgood DBDB ofof proteinprotein worldworld (PFAM)(PFAM) toto helphelp usus annotateannotate genomicgenomic DNADNA GeneWiseGeneWise algorithmalgorithm alignsaligns aa profileprofile HMMHMM directlydirectly toto thethe DNADNA GeneWiseGeneWise StartStart withwith aa PFAMPFAM domaindomain HMMHMM ReplaceReplace AAAA emissionsemissions withwith codoncodon emissionsemissions Allow for sequencing errors (deletions/ insertions) Add a 3-state intron model GeneWiseGeneWise ModelModel GeneWiseGeneWise IntronIntron ModelModel PY tract central spacer 5’ site 3’ site GeneWiseGeneWise FeaturesFeatures && ProblemsProblems ““BestBest ”” alignmentalignment ofof DNADNA toto proteinprotein domaindomain AlignmentAlignment givesgives exactexact exonexon --intronintron boundariesboundaries ParametersParameters learnedlearned fromfrom speciesspecies --specificspecific statisticsstatistics OnlyOnly providesprovides partialpartial prediction,prediction, andand onlyonly wherewhere thethe homologyhomology lieslies DoesDoes notnot findfind ““moremore ”” genesgenes PseudogenesPseudogenes ,, RetrotransposonsRetrotransposons pickedpicked upup CPUCPU intensiveintensive Solution:Solution: PrePre --filterfilter withwith BLASTBLAST RosettaRosetta Gene prediction is separated from sequence alignment. First , the alignment is obtained between two homologous genomic sequences using sequence global alignment Glass. Then , gene structures (splice sites, exon number and length, etc.) are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions. SyntenicSyntenic GeneGene PredictionPrediction This approach does not require the comparison of two homologous genomic sequences. A query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms. Gene prediction and sequence alignment are separated. SGPSGP --22 tblastx HSPs HSPs Projections Query Sequence geneid Exons SGP Exons GeneGene predicitionpredicition programsprograms predictpredict aa largelarge numbernumber ofof genesgenes almost every mouse gene has the human orthologue counterpart TwinScan SGP 48462 total 47055 17562 novel 21942 10987 multiexonic 12158 long no low 3171 complexity 4543 954 2983 human ts human sgp 317 637 1931 1052 intron human ts human sgp intron aligned aligned OrthologousOrthologous humanhuman mousemouse genesgenes havehave conservedconserved exonicexonic structurestructure .. 85% of the orhologous pairs have identical number of exons 91% of the orthologous exons have identical length 99.5% of the orthologous exons have identical phase there are a few cases of intron insertion/deletion (22) SummarySummary GenesGenes areare complexcomplex structuresstructures whichwhich areare difficultdifficult toto predictpredict withwith thethe requiredrequired levellevel ofof accuracy/accuracy/ confidenceconfidence DifferentDifferent approachesapproaches toto genegene findingfinding improveimprove accuracy/confidenceaccuracy/confidence ofof thethe predictions:predictions: Ab Initio : GenScan Ab Initio modified by BLAST homologies: GenomeScan Homology guided: GeneWise Gene prediction and sequence alignment separately: Rosseta Ab initio with similarity in known proteins: SGP -2 MerciMerci pourpour votrevotre attention!attention!.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages27 Page
-
File Size-