Detection of tissue-specific genes and Title computational analysis of testis-specific gene expression regulatory regions
Author(s) 山下, 明史
Citation
Issue Date
Text Version ETD
URL http://hdl.handle.net/11094/2599
DOI
rights
Note
Osaka University Knowledge Archive : OUKA
https://ir.library.osaka-u.ac.jp/
Osaka University
Detection of tissue-specific genes and computational analysis
of testis-specific gene expression regulatory regions
(組織特異的遺伝子の検出と精巣特異的遺伝子の
発現制御領域のコンピュータを使った解析)
by
Akifumi Yamashita
Department of Genome Informatics,
Genome Information Research Center,
Research Institute for Microbial Diseases,
Osaka University, 3-1 Yamadaoka, Suita,
Osaka 565-0871, Japan
February, 2009
1/97
Contents
LIST of ABBREVIATIONS 5
ABSTRACT 6
1 INTRODUCTION 7
2 METHODS 10
2.1 Detection of tissue-specific genes 10
2.2 Annotation of tissue-specific genes 10
2.3 Extracting the 5′-regulatory regions 11
2.4 Selection of over-represented 8-mers 11
2.5 Comparison of the 8-mer frequency within regulatory regions of 13 testis-specific genes with those of non-testis-specific genes.
3 RESULTS 14
3.1 Detection of tissue-specific genes 14
3.2 Selection of testis-specific genes which can be used for the 14 following analysis
3.3 Classification of the over-represented 8-mers 15
3.4 Conservations of the flanking region of highlighted 8-mers 16
2/97
4 DISCUSSION 18
ACKNOWLEDGEMENTS 23
REFERENCES 24
TABLES 27
Table 1 - Number of tissue-specific genes 27
Table 2 - Classification of 117 representative 8-mers on the basis of their 28
testis association level
Table 3 - 8-mers which appeared near by the other 8-mers 31
FIGURES 32
Fig. 1 - Number of tissue-specific genes 32
Fig. 2 - Identification of testis-specific genes and their 5′-regulatory 33
regions
Fig. 3 - Profiling of 634 testis-specific genes 35
Fig. 4 - Conservations of flanking region of the highlighted 8-mers in 36
testis-specific and non-testis-specific regulatory regions
3/97
APPENDIX TABLES 40
Appendix Table 1 - List of tissue specific genes 40
Appendix Table 2a - Testis-specificity of the genes, representative 8-mer 75
and annotation
Appendix Table 2b - Testis-specificity and annotation of the genes whose 90
regulatory regions were not completely sequenced or overlapped
with other regulatory regions
Appendix Table 2c - Testis-specificity and annotation of the genes which 92
were not supported to cover the intact 5'-ends
Appendix Table 2d - Testis specificity and annotation of 18 testis-specific 95
genes whose clone IDs were duplicated with other genes shown in
Appendix Table 2a, b, and c
Appendix Table 2e - Testis-specificity of the genes whose clone IDs were 96
not available
PUBLICATIONS 97
4/97
LIST of ABBREVIATIONS bp base pair(s) cAMP cyclic adenosine 3′, 5′-monophosphate cDNA DNA complementary to RNA
CRE cAMP response element
CSS cardiac specific sequence
GNF Genomics Institute of the Novartis Research Foundation
S.D. standard deviation
TPA 12-O-tetradecanoylphorbol-13-acetate
TRE TPA-responsive element
TSS transcription start site
UCSC University of California, Santa Cruz, USA
A Adenine
T Thymine
G Guanine
C Cytosine
R Guanine or Adenine (puRine)
Y Thymine or Cytosine (pYrimidine)
H Adenine or Thymine or Cytosine (not Guanine, H follows G)
V Guanine or Cytosine or Adenine (not Thymine (Uracil), V follows U)
B Guanine or Cytosine or Thymine (not Adenine, B follows A)
N Any nucleotide (aNy)
5/97
ABSTRACT
Accumulation of microarray data has enabled the computational analysis of gene expressions in various tissues. We have collected genes showing tissue, organ or cell specific genes using GNF mouse gene expression database. Thereafter, we searched for features of regulatory regions of testis-specific genes, because the genes showing testis-specific expression are the most abundant among the genes exhibiting tissue-specific expression, and no systematic study has been conducted for over-represented motifs within their regulatory regions. We have identified 117 over-represented 8 nucleotide sequences (8-mers) that appeared 2,648 times within the regulatory regions of 634 testis-specific genes. Of these, 64 over-represented 8-mers were significantly more frequent in the regulatory regions of testis-specific genes than in those of non-testis-specific genes. In this group of 8-mers, 4 8-mers differed from the canonical cAMP response element (CRE) 8-mer by 1 nucleotide, but appearance frequency of the canonical CRE in the regulatory regions of testis-specific genes is not significantly differ from those of non-testis-specific genes. We consider that these
CRE-like 8-mers participate in the regulatory expression of testis-specific genes to a greater extent than the canonical CRE 8-mer.
6/97
1. INTRODUCTION
The difference of cell types in a multicellular organism comes from the difference of gene expression. Gene expression can be regulated at many of the steps in the pathway from DNA to protein, such as transcriptional control, RNA processing control, RNA transport and localization control, translational control, mRNA degradation control, and protein activity control. For most genes, the major cause of the regulation is the transcriptional control. The transcriptional control of each gene is, whether complex or simple, done by the interactions between gene regulatory proteins and short stretches of DNA of defined sequence in a regulatory region relatively near the site where transcription begins (Molecular biology of The Cell, fourth edition chapter 7).
Now, we are able to search for such DNA sequences from the regulatory regions based on statistical methods, because of the determination of genomic sequences of more than 48 eukaryotes (http://www.ensembl.org/ on 2008’ Nov. 21) and transcription start sites (TSSs) of genes. FitzGerald et al. (2004) used the database of the TSSs (dbTSS, Suzuki et al., 2004) to collect human promoter sequence and investigated the distribution of 8 nucleotide sequences (8-mers). They found that 156
8-mers which cluster very significantly near the TSS could be placed into 9 groups of related sequences and 7 of them are found to be known transcription factor binding sites.
Further, microarray technology enabled us to know the expression level of many genes in many different tissues. This technology can be used to find genes with unique expression pattern, such as tissue-specific genes. For example, Bono et al.
7/97
(2003) identified 7,206 separate mRNA clones that satisfied stringent criteria for tissue-specific expression based on the expression profile of 57,931 clones on 20 mouse tissues using cDNA microarrays.
In this way, we are able to collect tissue-specific genes and their regulatory regions. We started our research detecting tissue-specific genes. Thereafter, we used testis-specific genes to investigate the mechanism of regulation of tissue-specific gene expression, because spermatogenesis is an excellent model for studying the regulation of gene expression during differentiation (Wolgemuth and Wartrin, 1991), and because no systematic study has been conducted for frequently appearing sequences within these regulatory regions to search for the candidates of transcription factor binding sites.
To find these motifs, we initiated the systematic isolation of testis-specific genes based on the method of Bono et al. (2003) with more stringent parameters, using the latest gene expression database of the Genomics Institute of the Novartis Research
Foundation (GNF), which provides gene expression data of 36,182 mouse genes obtained from 2 independent experiments with 61 tissues, organs, and cells (Su et al.,
2004). Functions of these testis-specific genes were checked, and then the regulatory sequences were extracted based on the positional information of the TSS (Suzuki et al.,
2004). Then we attempted to identify the regulatory motifs participating in the testis-specific gene expression on the basis of the following working hypotheses: (i)
Most of the regulatory motifs are localized within the 5′-regulatory regions between
–300 bp and +50 bp ([–300 to +50]) relative to the TSSs (FitzGerald et al., 2004), (ii) a length of 8 nucleotides (8-mer) is adequate for searching the regulatory motifs
(FitzGerald et al., 2004), and (iii) most of the significantly over-represented motifs are regulatory motifs.
8/97
To find significantly over-represented motifs within these regulatory regions, we conducted the following two-step analysis. First, to find these significantly over-representing 8-mers, we estimated the expected appearance of each 8-mers from the nucleic content of these regions. If appearance of an 8-mer is significantly more frequent than the expected one, the 8-mer is regarded as over-represented 8-mer. After collecting 8-mers showing the greatest significance in each regulatory region, we proceeded to the next step: we compared the appearance frequency of these 8-mer per regulatory region among testis-specific genes with those among non-testis-specific genes. If an 8-mer appears significantly more frequent in the regulatory region of testis-specific genes than in those of non-testis-specific genes, the 8-mer seems to be related to the expression of testis-specific genes.
9/97
2. METHODS
2.1. Detection of tissue-specific genes
Tissue-specific genes were selected from the Mouse GNF1M (MAS5-condensed) expression database, which was downloaded from the GNF website
(http://wombat.gnf.org/index.html). This database contains the gene expression data of 36,182 mouse genes obtained from 2 independent experiments with 61 tissues, organs, and cells (hereafter, they will be referred to simply as tissues). The gene expression intensities of the 2 experiments were averaged, and thereafter log-transformed. Since each gene has expression data from 61tissues, these data for each gene were normalized.
Bono et al. (2003) denoted the gene as tissue specific if the normalized value exceeded mean + 3 standard deviation (S.D.) for their cDNA microarray data and mean + 2 S.D. for Affy chips. However, we denoted the gene as “tissue-specific” if the normalized value in the testis exceeded mean + 4 S.D., for more stringent evaluation. We will refer to this value as the tissue specificity of the gene.
2.2. Annotation of tissue-specific genes
We downloaded the Chip Annotation Mouse GNF1M from the abovementioned
GNF site and searched for either the FANTOM cDNA clone ID (Carninci et al., 2005) or reference mRNA sequence (RefSeq) ID corresponding to each of the selected tissue-specific genes. Subsequently, the gene annotations were checked in the GNF annotation data and in the FANTOM 3 and RefSeq databases.
Specially, for testis-specific genes with no available functional annotations, we performed a BLAST search against the National Center for Biotechnology Information
10/97
(NCBI) non-redundant (nr) and/or RefSeq database and attempted to identify homologous genes having functional annotations (Altschul et al., 1997). We classified the genes with no functional annotations into 2 groups: the “hypothetical protein” group, which comprised genes encoding unidentified proteins with more than 100 amino acid residues, and the “unclassifiable gene” group, which comprised genes encoding unidentified proteins either with fewer than 100 amino acid residues or those without any reading frames.
2.3. Extracting the 5′-regulatory regions
Next, we checked whether the selected cDNA clones have experimental evidence to cover the intact 5′-ends either in the RIKEN FANTOM3 database (Carninci et al.,
2005) or in the database of transcriptional start sites (dbTSS; http://dbtss.hgc.jp/; Suzuki et al., 2004). FitzGerald et al. (2004) searched for regulatory motifs within the [–2500 to +500] region and showed in the figure of FitzGerald et al. (2004) that they mainly appear within the genomic [–300 to +50] regions relative to the TSS; therefore, we referred to these regions as the regulatory regions in our study. These regulatory regions were extracted from the genomic sequence of mouse mm5 provided by the
University of California, Santa Cruz, USA (UCSC) (Karolchik et al., 2003). When the regulatory regions of 2 testis-specific genes were found to overlap on the same strand, the region with the higher testis specificity was selected as the representative region.
All regulatory regions without a complete sequence were excluded from further analysis.
2.4. Selection of over-represented 8-mers
11/97
Over-represented 8-mers within the extracted regulatory regions were selected as follows: (i) The appearances of each possible 8-mer (48 = 65,536 8-mers) within the extracted regulatory regions of testis-specific genes were counted; (ii) a set of the shuffled sequences corresponding to each extracted regulatory region was generated from the original regulatory regions by random sampling with replacement (Matsumoto and Nishimura, 1998); (iii) the number of appearances of all possible 8-mers in the shuffled sequence set were counted; (iv) after repeating steps (ii) and (iii) 10,000 times, the average appearance and variance were calculated for each possible 8-mer; and (v) to estimate the statistical significance with regard to the over-representation of each 8-mer, the Z value was calculated using the following formula:
Z / ,