Identification of Bacillus subtilis RNA genes using Tiling Arrays Cyprien Guérin, . Basysbio

To cite this version:

Cyprien Guérin, . Basysbio. Identification of Bacillus subtilis RNA genes using Tiling Arrays. Bioin- formatique des ARN, Feb 2012, Toulouse, France. ￿hal-02804688￿

HAL Id: hal-02804688 https://hal.inrae.fr/hal-02804688 Submitted on 5 Jun 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Identication of Bacillus subtilis RNA genes using Tiling Arrays

Cyprien GUÉRIN BaSysBio Consortium Summary

High-resolution

Analysis of Tiling Array signals

Exemple of new features discovered with TA

Promoter and terminator predictions

Perspectives

2/25 High-resolution transcriptome Systematic exploration of B. subtilis transcriptional landscape

New genes/features discovery in Bacillus subtilis.

Explore most of the bacterium's lifestyles: 1 wild-type strain, maybe better called prototype strain. 1 array design (Basysbio tiling array, Nimblegen technology) : strand-specic expression signal with a 22-bp step. 269 hybridizations sampling a maximum variety of lifestyles, 104 dierent biological conditions, most with 2-3 biological replicates (experiments).

Growth on various media (rich/poor, solid/liquid, aerobic/anaerobic), sporulation, germination, competence, variety of stresses (including ethanol, salt, temperature, oxidative), etc.

3/25 High-resolution transcriptome Tiling array

22 bp

22 bp

≈ 380, 000 probes tiling the 4.2 Mbp Bacillus subtilis . Long probes (45-65 nt), lengths adjusted to achieve relative homogenous anity.

4/25 Analysis of Tiling Array signals Principles

Automatic detection of Transcription Units with a HMM model [1], taking into account: normalization (with genomic DNA hybridizations): 1. probes are not isothermal, 2. response is not linear, 3. outliers are discarded. continuous variation of the signal.

[1] Nicolas P., et al. (2009). Bioinformatics.

5/25 Analysis of Tiling Array signals Normalisation using chromosomal DNA

log(genomic DNA) from x4 pooled data

log(mRNA)

log(mRNA) − log(genomic DNA)

Probe anity is variable, despite the adjustment of probe lengths.

6/25 Analysis of Tiling Array signals Shift and drift signal level 6 8 10 12 14 16 CDSs moves

1100000 1102000 1104000 1106000 1108000 1110000

position on chromosome (bp)

7/25 Exemple of new features discovered with TA RNA genes

1 228 001 1 238 000

mecA yjbF yjbG yjbL yjbM yjbN yjbH yjbI yjbJ yjbK yjbO yjbE

2.946 2.845

8/25 Exemple of new features discovered with TA Coding sequences

1 070 001 1 080 000

yhaI ecsA ecsB ecsC prsA yhaK hpr yhaH yhaG serC hit yhaA yhaJ Sequence features annotation Transcriptome Forward strand

Log(2) ratio Fwd

5.667 Backward strand

Log(2) ratio Bwd

9/25 2.409 Exemple of new features discovered with TA Antisense related to stress

3 567 001 3 577 000

yvcN crh yvcL yvcK yvcJ yvcI trxB yvcE yvcD

10/25

3.008 3.393 7.624

2.639 6.296

3.111 2.848 4.37 Exemple of new features discovered with TA A few numbers

In B. subtilis annotation v3: 4,256 CDSs, 5 RNA genes, 30 rRNAs, 86 tRNAs, 57 (-1) 5' cis-acting regions.

New features discovered with TA: 44 new CDSs, 136 new RNA genes, 423 antisense signals (including 4 CDSs and 87 RNA genes), 92 5' cis-acting regions (conrmed for 56), 676 long 5'UTR regions and 125 long 3'UTR regions.

11/25 Exemple of new features discovered with TA Combining with ChIP/chip (CcpN)

2 962 001 2 972 000

ytbD ytbE dnaI dnaB ytcG speD gapB ytcD ytaG ytaF mutM

Sequence features annotation

Transcriptome Forward strand

Glucose to Malate

Backward strand

Forward strand

Malate to Glucose

Backward strand CcpN DNA binding (CHiP/chip)

12/25 Exemple of new features discovered with TA Combining RNA gene expression with ChIP/chip (CcpN)

1 528 001 1 538 000

SR1 pdhA pdhB pdhC pdhD yktA ykzI yktC ykzC slp speA yktB

13/25 Promoter and terminator predictions From upshifts to promoters

TSSs position estimation using TA compared to RNA-Seq data [1]. Frequency 0 50 100 150

−100 −50 0 50 100

Distance between upshifts and TSSs

14/25 [1] Irnov I., et al. (2010). Nucleic Acids Res.. Promoter and terminator predictions From upshifts to promoters

Summarizing correlations between promoter activities.

Cluster Dendrogram Height 0.0 0.1 0.2 0.3 0.4 0.5

A 'promoter tree' is built by hierarchical clustering using average linkage

on the dissimilarity matrix di,j = (1 − ri,j )/2 ∈ [0, 1] where ri,j is the correlation between activities of promoters i and j. 15/25 Promoter and terminator predictions From upshifts to promoters

TSS −35 boxspacer −10 box

background PWM2 PWM1

l2 S l1 D

Promoters prediction: unsupervised algorithm for modeling of bipartite degenerate motifs [1], clustering of sequences from the 3,242 transcription upshifts.

[1] Nicolas P., et al. (2012). Science.

16/25 Promoter and terminator predictions From upshifts to promoters

Behavior of the MCMC algorithm, with K = 20 motifs

17/25 Promoter and terminator predictions From upshifts to promoters

Comparison with known Sigma factor binding sites

DBTBS: a database of transcriptional regulation in Bacillus subtilis DBTBS M19 M14 M4 M3 M7 M5 M16 M8 M11 M13 M17 M9 M1 M15 M10 - M2 M18 M20 M6 M12 - 401 369 349 213 218 170 170 134 127 113 80 43 63 72 48 44 16 11 12 4 5 SigA 59 90 49 1 33 1 22 0 1 0 19 0 1 0 1 1 0 0 0 7 0 SigB 0000000044000000000000 SigD 0000100000100023000000 SigE 0015404010000100000000 SigF 0008000101000010000000 SigG 0000000420000000000000 SigH 0001001100011200000000 SigI 000000000000000010000 SigK 1001038000000010000000 SigL 000000000000000006000 SigM 000000000001000000000 SigW 0010000000033000000000 SigX 000000000002000000000 SigY 000000000002000000000 Sequence logos to represent motifs

18/25 Promoter and terminator predictions From upshifts to promoters

Predicted promoters: 758 promoters in DBTBS, 2,935 predicted promoters using algorithm above, 580 promoters in commun, 2,355 new promoters discovered.

46% genes with multiple promoters.

19/25 Promoter and terminator predictions Terminators and downshifts

Terminator predictions:

3,510 putative sites from genome-scan with Petrin Software [1], identication of 2,126 high condence down-shift sites, 1,501 putative terminators conrmed by downshifts ( 70% of down-shifts).

Three types of terminations: sharp, partial, missed termination.

[1] d'Aubenton-Carafa Y., et al. (1990). J. Mol. Biol..

20/25 Promoter and terminator predictions A few examples

1 070 001 1 078 000 1 230 001 1 234 000

yhaL coiA yhaI yhzF pepF yhzE yhaJ yhaH serC ecsA prsA hinT scoC trpP yizD

U930.B U935.A1 U797.H U792.E U794.K U799.A4 U803.E U931.H D627 D536 D540 D544 U932.G D629 D535 D537 D539 D541 D542 D543 D545 U933.A1 D628 U793.A1 U798.A5 U802.A3 U804.A3 U805.A3 U934.E U795.M15 U796.A5 U801.A4 U806.A1 U807.G yhaL yhzE S349 yhaI yhzF ecsA coiA pepF S415 S347 S352 S354 S414 prsA yhaJ scoC yhaH trpP serC hinT yizD S348 S351 S353 S355 S356 S357 a b 21/25 Promoter and terminator predictions A few more examples

2 839 001 2 843 000 1 297 001 1 301 000 694 001 698 000

yrzT ndh yebD yrzF yrbD yjlB uxaC pbuG yrbE yrzH yjlA rex yebC Purine

U2147.A1U2150.A4 U1005.B U2148.E U2151.E U1006.A5 U493.A5 U494.W D1421D1423 U2152.M21 D684 D314 D316 D1422 D313 D317 U2149.A7 U1008.H

yrzFyrzT yrbD yjlBS451 ndh uxaC pbuG S228 S230 S1053 S450 rex yebC yebD yrzH yrbE S1051 yjlA S1052 S449 a b c 22/25 Perspectives

Huge set of expression data (104 conditions) on gene repertoire for B. subtilis: functional annotation (CDSs, RNA genes, etc.).

Antisense and transcription accuracy in bacteria: biological function, bias with alternative promoters, majority of signals with missed termination, promoters for antisense less conserved than promoters for CDSs.

23/25 Thank you

1 070 001 1 078 000 1 230 001 1 234 000

yhaL coiA yhaI yhzF pepF yhzE yhaJ yhaH serC ecsA prsA hinT scoC trpP yizD

U930.B U935.A1 U797.H U792.E U794.K U799.A4 U803.E U931.H D627 D536 D540 D544 U932.G D629 D535 D537 D539 D541 D542 D543 D545 U933.A1 D628 U793.A1 U798.A5 U802.A3 U804.A3 U805.A3 U934.E U795.M15 U796.A5 U801.A4 U806.A1 U807.G yhaL yhzE S349 yhaI yhzF ecsA coiA pepF S415 S347 S352 S354 S414 prsA yhaJ scoC yhaH trpP serC hinT yizD S348 S351 S353 S355 S356 S357 a b 24/25 Thank you

25/25