Molecular Vision 2002; 8:161-3 © 2002 Molecular Vision Received 22 April 2002 | Accepted 5 June 2002 | Published 15 June 2002

A project for ocular bioinformatics: NEIBank

Graeme Wistow

Section on Molecular Structure and Function, National Institute, National Institutes of Health, Bethesda, MD

Background: NEIBank is a project aimed at helping to increase the access to full-length clones and functionally sig- integrate different kinds of data for eye research into a mo- nificant splice variants, and to develop informatics tools to lecular encyclopedia of the eye. This should eventually in- analyze and display the results. The aim is to use cDNA li- clude a wide range of input from genomics, genetics, struc- braries that represent as closely as possible the transcript pro- ture/function and expression studies. As a starting point, the file of the specialized tissues of the eye. While cultured cells project has begun with efforts to assemble a catalog of the can be powerful experimental tools it is clear that they differ expressed in different parts of the eye. significantly from the tissue from which they were derived. A large proportion of all known (but usually not under- As just one of many known examples (both published and stood) human genes have been identified through expressed unpublished), MHC expression seems to be absent in sequence tag (EST) analyses and major efforts, such as the intact lens but is present in cultured cells that are used as mod- Cancer Genome Anatomy Project (CGAP), have been insti- els for lens [12]. tuted to produce EST resources for different tissue and dis- Various steps in library synthesis and analysis can also ease systems. This type of genome project uncovers new genes alter the abundance and quality of clones. For example, cDNA but, equally importantly, it also helps to define the repertoire libraries are commonly amplified to increase the available re- of expressed genes, novel or known, for different tissues and source. This can drastically reduce the frequency of some cell types. Overall, public EST projects have contributed over clones that grow poorly, perhaps because of size, G+C con- 9 million partial cDNA sequences from many species and re- tent or other factors. In many studies aimed primarily at gene sources such as UniGene have been employed to group these discovery, cDNA libraries are normalized [13]. This is a pow- sequences into clusters that (potentially) represent specific erful methodology that essentially employs a process of self- human genes. However, with the notable exception of , subtraction to enrich for tissue-preferred or rare transcripts most human eye tissues have, until quite recently, been rather and to remove the more abundant clones. However, it neces- poorly represented in most of the analyses available in the sarily eliminates any information on transcript abundance and public domain. also tends to increase the abundance of cloning artifacts (since The eye is a complex system of highly differentiated tis- these are rare). So, as far as possible, NEIBank uses un-am- sues of various developmental origins. Many genes essential plified, un-normalized libraries to obtain as close as possible for eye function are tissue-specific or highly tissue-preferred a representation of normal transcript abundance, to maximize and, not surprisingly, many of these have proved to be associ- clone length and to allow the discovery of “difficult” clones ated with genetically based eye diseases [1-4]. At the same that might disappear during library manipulation. As described time, many genes with more diverse patterns of expression below and in the accompanying papers, this seems to have are also essential for normal eye function, while for both novel been successful [14-17]. In some cases, libraries have later and known genes there may be alternative transcripts that can been amplified and normalized to reduce the content of highly have important consequences for the role of the gene in the abundant clones (such as crystallins in the lens) for “deeper” eye. For example, the transcription factor Pax6 is expressed sequencing of the library, or in procedures to reduce the con- in different parts of the eye, brain and pancreas [5-8] but it tent of empty vector [14,15], but in these cases, low tempera- seems to use different patterns of alternative splicing in dif- ture, semi-solid methods have been used during library ex- ferent tissues to fine-tune its function and to select different pansion in order to minimize any growth bias [18,19]. families of target genes [9-11]. Sequencing Strategy: Typical EST analyses employ 3' Strategies for NEIBank: For NEIBank a key strategy has sequence reads in order to anchor clusters of clones to “unique” been to produce high quality cDNA libraries from dissected 3' ends. This has been useful, but as perusal of UniGene will tissues, to sequence the clones primarily from the 5' end to reveal, it has not been completely successful. Many genes have multiple 3' ends and there may be mis-priming at internal A- rich regions. Furthermore, 3’sequencing often encounters prob- Correspondence to: Graeme Wistow, Ph.D., Chief, Section on Mo- lems in reading through the highly repetitive polyA tail. For lecular Structure and Function, National Eye Institute, Building 6, these reasons 5' sequence reads have been emphasized for Room 331,National Institutes of Health, Bethesda, MD, 20892-2740; NEIBank, although some sets of clones have been sequenced Phone: (301) 402-3452; FAX: (301) 496-0078; email: in both directions to confirm that the libraries are generally [email protected] 161 Molecular Vision 2002; 8:161-3 © 2002 Molecular Vision complete at the 3' end and to gain more sequence for appar- binding [16]. Some of the novel genes seen in these ently novel clones. This strategy gives a high return in num- libraries may have escaped detection previously because of bers of “quality” sequences, as judged by the program PHRED their high G+C content. This is exemplified by oculospanin, [20], and also increases the chances of observing novel pro- which is found in the and RPE/ libraries, and whose tein coding regions and alternative splicing events. This is cDNA is 70% G+C rich [15,17], and IEGF/PDGFD, a new because 3' untranslated regions (UTR), which naturally con- member of the PDGF/VEGF family of growth factors that is tain no open reading frame (ORF), may be long and are rarely expressed in human iris [15]. interrupted by introns in the genome. Variant Transcripts: Detailed inspection of many of the The un-normalized libraries subjected to extensive se- groups of clones revealed novel splice variants with potential quencing so far show a high fraction of “full-length” cDNAs, biological significance. These include a major new splice form inasmuch as over 50% of cDNAs corresponding to known of the lens protein MP19/Lim2 that encodes a larger version sequences contain the initiator codon start site of the open of the protein [14]; a variant of the retinal transcription factor reading frame. Indeed, the un-normalized lens library [14], Nrl that make use of an exon in what would otherwise be an which contains a high content of relatively short transcripts intron of the major transcript [16]; alternative versions of due to its crystallin content, contains approximately 75% “full- Bestrophin, that are actually the dominant forms of transcript length” cDNAs. For the libraries representing the four tissues for this gene that are detected among cDNAs from the RPE/ described in the accompanying papers, human lens, iris, retina choroid library [17]; and a splice variant of oculoglycan/opticin and RPE/choroid [14-17], the content of bacterial, mitochon- that deletes a conserved motif without disrupting the rest of drial and other contaminant sequences is typically no more the open reading frame [17]. Many other splice variants can than 5% and ribosomal RNA content is low, ranging from 0.4% be seen. Some are unlikely to produce functional but in the normalized iris library to 3% in the un-normalized lens could still have biomedical importance. In very long-lived cells library. The typical read length for quality EST sequences is such as those in lens, retina and RPE, the accumulation of 500 bp and is often longer. mis-spliced transcripts and their protein products could con- Informatics: A major effort has been made to use tribute to declining cellular function, particularly with age. A bioinformatics to assemble, organize and present the NEIBank possible example of this sort of splice “accident”, involving EST data, to identify and group the high quality sequences γS-crystallin in the lens, has been described previously [24]. and to remove the various classes of poor quality, non-mRNA Normal Repertoire: In addition to the discovery of new contaminants and chimeric clones. This has evolved into a genes, the EST libraries also make a useful contribution to rules-based procedure named GRIST (GRouping and Identi- cataloging the genes that are normally expressed in eye tis- fication of Sequence Tags) [21] that uses sequence matches sues. These data provide a baseline for other studies, such as generated by BLAST programs [22] and extracts information SAGE analyses [25] in which short tag sequences derived from from GenBank, UniGene, and other databases. The collated the 3' ends of cDNAs are identified. Novel SAGE tags may information on grouped and identified cDNAs from the EST represent new genes. However, they are typically not long analyses is displayed at the NEIBank web site. In addition to enough to allow unambiguous identification of new genes and data derived from libraries made specifically for NEIBank, they could also arise through cloning or amplification arti- the same procedures are used to extract, organize and display facts. Thus it is useful to have a reference set of cDNA clones EST data for eye tissues from parallel sequencing efforts. Many for comparison. A recent SAGE study has suggested that many keywords for topics such as functional class and chromosomal of the most abundant genes expressed in RPE are tissue-spe- location are incorporated as well as links to many related sites, cific and novel [26]. In contrast, cDNA EST sequencing, which including a direct link for each group or cluster of related se- allows for clearer identification, finds that the most abundant quences to the human and mouse genome builds at the Hu- genes expressed in RPE/choroid are not tissue specific [17] man Genome Project. and raises the possibility that novel SAGE tags may not nec- However, the real purpose of all this work is to produce essarily represent novel genes. insight into the molecular mechanisms of the eye. The four Information on the transcriptional repertoire of eye tis- accompanying papers, that describe cDNA libraries for hu- sues is also of importance for micro-arrays studies. The EST man eye tissues, give examples of some of the classes of bio- expression data provide a view of what should be detectable logically interesting information that can be mined from the in array hybridizations of RNA from particular tissues, help- accumulating data. Here are some highlights. ing to validate baseline studies. More directly, sequence veri- Gene Discovery: In addition to a number of apparently fied EST clones obviously provide the critical resource for novel genes whose transcripts are represented in the sequence the construction of cDNA micro-arrays. Indeed, clones from datasets at single copy levels, several newly recognized genes the NEIBank collection are now being used to construct cDNA were found amongst the most abundant cDNA clones in the arrays of eye-expressed genes. The expression data also help libraries. These include lengsin, a novel glutamine-synthetase in the evaluation of commercially produced arrays, giving some superfamily member in the lens [14]; oculoglycan/opticin, a basis for judging how many of the major genes expressed in novel member of the small leucine rich proteoglycans found eye are actually represented. in iris and in RPE [15,17,23] and retbindin, an abundant tran- Future Developments: Since the NEIBank web site has script in the retina library that appears to encode a secreted been available, several other research groups have joined the 162 Molecular Vision 2002; 8:161-3 © 2002 Molecular Vision effort to expand the EST representation of eye tissues for hu- 13. Bonaldo MF, Lennon G, Soares MB. Normalization and subtrac- mans and for other species. As a result of this, and of continu- tion: two approaches to facilitate gene discovery. Genome Res ing sequencing of several NEIBank libraries, the database 1996; 6:791-806. continues to expand. Collaborations are already underway to 14. Wistow G, Bernstein SL, Wyatt MK, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analy- add resources for structural biology and proteomics to the sis of adult human lens for the NEIBank Project: Over 2000 NEIBank web site. Suggestions, additions and links to other non-redundant transcripts, novel genes and splice variants. Mol relevant resources are welcomed. Vis 2002; 8:171-84. 15. Wistow G, Bernstein SL, Ray S, Wyatt MK, Behal A, Touchman ACKNOWLEDGEMENTS JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag I thank the colleagues and collaborators who have made this analysis of adult human iris for the NEIBank Project: Steroid- effort possible. Their contributions are recognized in the ac- response factors and similarities with retinal pigment epithe- companying papers. I also particularly thank Dr. Robert lium. Mol Vis 2002; 8:185-95. Nussenblatt of NEI for his support and encouragement. 16. Wistow G, Bernstein SL, Wyatt MK, Ray S, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of human retina for the NEIBank Project: Retbindin, REFERENCES an abundant, novel retinal cDNA and alternative splicing of other 1. Hejtmancik JF. The genetics of cataract: our vision becomes clearer. retina-preferred gene transcripts. Mol Vis 2002; 8:196-204. Am J Hum Genet 1998; 62:520-5. 17. Wistow G, Bernstein SL, Wyatt MK, Fariss RN, Behal A, 2. He W, Li S. Congenital cataracts: gene mapping. Hum Genet 2000; Touchman JW, Bouffard G, Smith D, Peterson K. Expressed 106:1-13. sequence tag analysis of human RPE/choroid for the NEIBank 3. Phelan JK, Bok D. A brief review of retinitis pigmentosa and the Project: Over 6000 non-redundant transcripts, novel genes and identified retinitis pigmentosa genes. Mol Vis 2000; 6:116-24 . splice variants. Mol Vis 2002; 8:205-20. 4. Clarke G, Heon E, McInnes RR. Recent advances in the molecular 18. Hanahan D, Jessee J, Bloom FR. Plasmid transformation of Es- basis of inherited photoreceptor degeneration. Clin Genet 2000; cherichia coli and other bacteria. Methods Enzymol 1991; 57:313-29. 204:63-113. 5. Li HS, Yang JM, Jacobson RD, Pasko D, Sundin O. Pax-6 is first 19. Kriegler M. Gene transfer and expression: a laboratory manual. expressed in a region of ectoderm anterior to the early neural New York: Stockton Press; 1990. plate: implications for stepwise determination of the lens. Dev 20. Ewing B, Green P. Base-calling of automated sequencer traces Biol 1994; 162:181-94. using phred. II. Error probabilities. Genome Res 1998; 8:186- 6. Walther C, Gruss P. Pax-6, a murine paired box gene, is expressed 94. in the developing CNS. Development 1991; 113:1435-49. 21. Wistow G, Bernstein SL, Touchman JW, Bouffard G, Wyatt MK, 7. St-Onge L, Sosa-Pineda B, Chowdhury K, Mansouri A, Gruss P. Peterson K, Gao J, Buchoff P, Smith D. Grouping and identifi- Pax6 is required for differentiation of glucagon-producing al- cation of sequence tags (GRIST): Bioinformatics tools for the pha-cells in mouse pancreas. Nature 1997; 387:406-9. NEIBank database. Mol Vis 2002; 8:164-70. 8. Turque N, Plaza S, Radvanyi F, Carriere C, Saule S. Pax-QNR/ 22. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Pax-6, a paired box- and homeobox-containing gene expressed local alignment search tool. J Mol Biol 1990; 215:403-10. in neurons, is also expressed in pancreatic endocrine cells. Mol 23. Hobby P, Ward FJ, Denbury AN, Williams DG, Staines NA, Sutton Endocrinol 1994; 8:929-38. BJ. Molecular modeling of an anti-DNA autoantibody (V-88) 9. Epstein JA, Glaser T, Cai J, Jepeal L, Walton DS, Maas RL. Two and mapping of its V region epitopes recognized by heterolo- independent and interactive DNA-binding subdomains of the gous and autoimmune antibodies. J Immunol 1998; 161:2944- Pax6 paired domain are regulated by alternative splicing. Genes 52. Dev 1994; 8:2022-34. 24. Wistow G, Sardarian L, Gan W, Wyatt MK. The human gene for 10. Richardson J, Cvekl A, Wistow G. Pax-6 is essential for lens- gammaS-crystallin: alternative transcripts and expressed se- specific expression of zeta-crystallin. Proc Natl Acad Sci U S A quences from the first intron. Mol Vis 2000; 6:79-84 . 1995; 92:4676-80. 25. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analy- 11. Jaworski C, Sperbeck S, Graham C, Wistow G. Alternative splic- sis of gene expression. Science 1995; 270:484-7. ing of Pax6 in bovine eye and evolutionary conservation of in- 26. Sharon D, Blackshaw S, Cepko CL, Dryja TP. Profile of the genes tron sequences. Biochem Biophys Res Commun 1997; 240:196- expressed in the human peripheral retina, macula, and retinal 202. pigment epithelium determined through serial analysis of gene 12. Shaughnessy M, Wistow G. Absence of MHC gene expression in expression (SAGE). Proc Natl Acad Sci U S A 2002; 99:315- lens and cloning of dbpB/YB-1, a DNA-binding protein ex- 20. pressed in mouse lens. Curr Eye Res 1992; 11:175-81.

The print version of this article was created on 11 July 2002. This reflects all typographical corrections and errata to the article through that date. Details of any changes may be found in the online version of the article.

163