Figure 1. The overview of BioThesaurus construction

Name Filtering EntrezGene UniProtKB NCBI RefSeq UniProt PIR-PSD Highly GenPept Name Ambiguous Nonsensical Extraction Terms BioThesaurus

FlyBase WormBase iProClass Raw UniProtKB Genome MGD Thesaurus Entries: SGD RGD / Names & Semantic Typing Synonyms

HUGO Other EC UMLS OMIM

1 Figure 2. BioThesaurus search and report (Examples: UniProtKB P48032 and TIMP3).

A A. The BioThesaurus interface provides search options of using text (1, e.g., TIMP3) or UniProtKB identifier (2, e.g., P48032) (http://pir.georgetown.edu/p irwww/iprolink/biothesauru s.shtml).

B B. Text search results using TIMP3. Additional search with Boolean operations are provided (3); Number of entries sharing the same term is retrieved and listed along with their names, organisms and protein classifications (4); Among the entries is the rat UniProtKB entry P48032 (TIMP-3) (5).

C C. BioThesaurus Report based on UniProtKB entry P48032. General protein information is provided for each protein (6). Detailed report of synonyms and their sources include the total numbers of synonym groups, their text variants and distinct data sources are given (7), followed by names of each synonym group (8), frequency counts (9), and text variants and their sources (10) (http://pir.georgetown.edu:6 0888/cgi- bin/biothesaurus.pl?id=P48 032).

2 Table 1. Source databases for BioThesaurus construction.

#Records Mapped to Annotation Fields # Names in Database UniProtKB Extracted BioThesaurus UniProtKB DE 2,083,713 1,070,596 GN ALTERNATE_NAMES PIR-PSD GENETICS 243,796 340,397 TITLE DESCRIPTION Gene 990,757 SYMBOL 1,200,792 SYNONYMS RefSeq 1,314,514 DEFINITION 380,770 FEATURES_CDS Genpept 2,429,669 FEATURES_gene 603,950 FEATURES_mRNA ALIASES ENZYME NAME HUGO 11,739 44,166 PREVIOUSGENENAME PREVIOUSSYMBOL SYMBOL NAME FlyBase 16,463 SYMBOL 52,346

SYN NAME SSLP RGD 6,165 12,221 SYMBOL ALIASES NAME MGD 11,585 SYMBOL 33,696 SYNONYM GENEPRODUCT LOCUSNAME SGD 4,486 11,742 ORFNAME OTHERNAME LOCI WormBase 20,282 24,401 WORMPEP ECNUM:AN EC 2,394 6,966 ECNUM:DE OMIM 8,857 TI 35,029 Total number of database records: 7,144,420 Total number of distinct names: 2,869,972

This table is based on BioThesaurus Release 1.0, August 25, 2005. The updated table is available at http://pir.georgetown.edu/iprolink/biothesaurus/statistics/.

3

Table 2. BioThesaurus name filter (regular expressions to filter out a list of highly ambiguous and nonsensical terms)

# the meaning of the patterns # [0-9\.]: allowable characters are 0, 1, 2, …, 9,and . # +: one or more # ?: zero or one # *: zero or more # \s: blank space # (a|b): a or b

[0-9\.]+k?aa\s*long hypothetical ? #Example: 202aa long hypothetical protein

[0-9\.]+k proteins?

[0-9]+\-[0-9]+

[0-9\.]+k?

hypothetical protein precursors?

unnamed protein products?

(conserved|conserved hypothetical|expressed|hypothetical| hypothetical conserved| novel|predicted|putative|putative exported|unknown)( polyproteins?| proteins? | orfs?)? #Examples: conserved proteins # putative exported proteins # expressed protein

4 Table 3. Some examples of UMLS semantic categories that are related to BioThesaurus. For complete listing, please refer to http://pir.georgetown.edu/iprolink/biothesaurus/data/UMLSSem.txt

Semantic Semantic Type Definition Category ID

T116 Amino Acid, Peptide, or Protein T028 Gene or Genome T126 Enzyme T123 Biologically Active Substance T192 Receptor T129 Immunologic Factor T121 Pharmacologic Substance T109 Organic Chemical T044 Molecular Function T047 Disease or Syndrome T125 Hormone T131 Hazardous or Poisonous Substance T124 Neuroreactive Substance or Biogenic Amine T118 Carbohydrate T114 Nucleic Acid, Nucleoside, or Nucleotide T119 Lipid T059 Laboratory Procedure T045 Genetic Function

5