ECHO: Encyclopedia of Hepatocellular Carcinoma Genes Online

ECHO: Encyclopedia of Hepatocellular Carcinoma Genes Online

<p>ECHO: Encyclopedia of Hepatocellular Carcinoma Genes Online Abstract</p><p>1. Introduction Hepatocellular carcinoma (HCC), which involves the malignant tumor of the liver, is one of the most frequent malignant neoplasms. Its incidence varies greatly with geographical location, sex, ethnic background, etc., and has been especially prevalent among Asian populations. Chronic infection of the hepatitis B (HBV) or hepatitis C virus (HCV), ingestion of food contaminated with chemical carcinogens and consumption of alcoholic beverages are major risk factors. Marked as the top cause of cancer death worldwide, research on its cause, diagnosis, and treatment continues into this post-genomic era. The emergence of genomic-related technologies has resulted in an exponential increase in potential targets for HCC diagnosis and treatment. Microarray enables HCC research to succeed in where traditional methods have faltered. With the ability to investigate massive mRNA expression profiles simultaneously, microarray is used to discover potential markers of HCC development, predict disease recurrence, and identify the specific genes related to HCC. Yet the research environment is far from perfect. Information regarding HCC- related gene sets is scattered. They reside in separate labs, websites, databases, published literatures; they may appear with different ID, gene names, or aliases; they may be presented only with a portion of the complete gene annotation. To obtain the whole picture of a gene, one must browse through various places to collect its information: genomic location, sequence, homologs, pathways, protein-protein interaction, related diseases, microarray studies, etc. There have been several attempts to integrate cross-site information, but the outcomes lack rich annotation and are therefore insufficient for supporting HCC research: GeneWebEx (http://www.medinfopoli.polimi.it/GeneWebEx), a software package for mining web- based biomolecular databanks that queries on its collected data, lacks the friendliness of a web-based tool; GeneAround( http://db.aist-nara.ac.jp/genearound /), a GO-based gene annotation databank, presents only text format; GENA (http://gena.ontology.ims.u-tokyo.ac.jp/), which extracts information from articles by natural language processing (NLP), supports automatic gene extraction, gene full name, symbol, and synonym lookup, has little annotation related to HCC. All this calls for the establishment of an infrastructure that can collect the scattered annotations, present them in a user-friendly way, and allow viewers to participate actively in its making. In this study, we join the paradigm shift of providing web-based services and publishing organized information via semantic webs. Softbots, or software agents, are implemented to collect scattered gene annotations either by mining data sources directly or by querying into publicly accessible databases. The focus is to design of an information-harvesting infrastructure with flexible storage/presentation system, capable of developing into an excellent content management environment supporting both human-human and human-computer interactions. What resulted is EHCO, an integrated biological information portal for efficient information sharing and extensive aggregation of research-related topics. EHCO demonstrates how HCC- related research can be gathered and shared among collaborators. </p><p>In the following sections, we will describe the methods used to integrate different types of data into EHCO.</p><p>2. Materials and Methods 2.1 Collecting HCC-related gene sets The fundamental part of a HCC-related information databases is the genesets that have been reported to be related to HCC. In order to carry out an all-around web service, EHCO aims to provide structured information between these genes and HCC reported in literature when users query a candidate gene. Since the amount of biomedical literature available on the web is rapidly increasing, manual information extraction cannot always be the case. Different collection methods were applied, as described from 2.1.1 to 2.1.2.</p><p>2.1.1 Import HCC-related gene sets from published large-scale studies The current EHCO project contains genesets from Stanford Microarray Database (SMD HCC 1648, Chen et al., 2002), Neo et al. (2004), and SAGE. The SMD HCC data characterized genomic expression patterns in more than 200 samples, including 102 primary HCC and 74 nontumor liver tissue samples. It deduced 1648 genes that were differentially expressed in HCC vs. nontumor liver samples (p < 0.01 by Student’s t test with Bonferroni correction). Neo et al. (2004) identified 218 genes that were most differentially expressed (p < 1 × 10-6 and at a least 1.5-fold change in expression) in the tissue samples of 37 HBV-associated HCC patients. The SAGE dataset was collected from CGAP library by using a 2-fold difference in tumor vs nontumor sample as criterion, which resulted in a geneset of 589 genes. 2.1.2 Text-mining for HCC-related genes in literatures The text-mining method consists of the following steps: (1) Acquire HCC-related literature from PubMed using “hepatocellular carcinoma” as keyword. This study used the latest approved human genome nomenclature from HUGO Gene Nomenclature Committee (http://www.gene.ucl.ac.uk/nomenclature/). A Perl program was written to look for existence of HUGO-approved gene names, symbols, and aliases, in the title and abstract part of HCC-related literatures. In this way, a list of genes was identified that are possibly related to HCC. (2) The potential gene list was verified by experts in biotechnology. 2.1.3 Gene collection by reading published literatures 450 HCC-related genes were identified by manually reading published literatures. The resulting geneset is denoted in EHCO as the TableX geneset (which contains two subsets: mRNA and protein).</p><p>2.2 Handling the annotation Once related gene sets had been collected into EHCO, the annotation handler steps in. First, softbots, or intelligent software agents, were implemented to harvest gene annotation from various web resources. Then, weblinks were established, and gene-disease relationships were identified through natural language processing. Protein-protein interaction networks were then predicted. Finally, a presentation engine integrated all these information into a single user-friendly page view. </p><p>2.2.1 Harvesting annotation through softbots A softbot is an intelligent software robot that acts, on behalf of the user, to achieve certain goals. Given the resources (which can be online websites, databases, or documents), a softbot extracts the information it is demanded of. In our study, individual softbots are used to mine different targets. Most of the softbots in current EHCO were written in Perl. These programs periodically check for updates from NCBI, GeneOntology, HUGO, etc., and download them into our database. 2.2.2 Establishing weblinks Hyperlinks to UniGene, SwissProt, OMIM, GeneCard, GO, PubMed, as well as other important bioinformatics websites were collected to the EHCO database. 2.2.3 Information retrieval by natural language processing To elucidate the relationship between the collected genes and HCC, this study applied a natural language processing (NLP) technique to extract information from literatures. To begin with, here is a sample text from PudMed 10632334 containing interesting relations to illustrate the idea of automatic information extraction: “…Using semiquantitative reverse transcription - PCR for alpha - fetoprotein (afp) and albumin (alb) mRNAs, we measured the mass of malignant and nontumor hepatocytes in 53 peripheral blood samples collected preoperatively, intraoperatively, and postoperatively from 13 HCC patients … In 100% (23 of 23) of HCC and adenoma patients, alb mRNA levels increased 10 - 10(6) - fold intraoperatively and then markedly declined within 8 weeks after operation …” From the above text one understands that, in order to extract information related to a certain topic, the extraction tool needs to be capable of accomplishing two tasks - (1) Named Entity Recognition (NER): to recognize biomedical named entities (NEs), e.g., afp, alb, mRNA and HCC; and (2) Named Entity Relation Recognition (NERR): to recognize interesting relations between NEs, e.g., HCC and mRNA level. Most biomedical named entities have no nomenclature; they may appear as long compound words (ex: hepatocellular carcinoma), or short abbreviations (ex: HCC). Symbols and spellings may also be different. To handle this NER problem, a NE list was defined. The list contained the following NEs: gene, protein, mRNA, serum, hepatitis B virus (HBV), hepatitis C virus (HCV), methylation, liver regeneration, HCC, cirrhosis, fibrosis, necrosis. Once the NER problem was tackled, we proceeded to investigate the relations between NEs. Relations are usually expressed in various verbal forms, including active voice, passive voice, nominalization, and gerund forms. Some relations are in adjective or adverb forms. The complex sentence structure of published literatures made the situation even more sophisticated. This study used natural language parser and template-based methods to solve this problem. Our gene-HCC knowledge base system consisted of the following steps (see Fig. 1): (1) Document retrieval/filtering (DR/DF): Documents addressing gene- HCC relation from PubMed were automatically retrieved by searching for keyword combinations (gene symbol or aliases) and (hepatocellular carcinoma). The documents were then downloaded. (2) Biological information extraction (BioIE): The NER and NERR system described above was processed in this step. The NER system detects whether any of the NEs in the NE list appeared in the paper abstracts. If yes, the NE is marked “” in the result table. The NERR system then detects whether increased or decreased expressions of genes, proteins, mRNAs and serum were reported. Table 1 demonstrates the result table using the sample text from PubMed. 2.2.4 Identifying pathways, ontology, homology, and functional annotations Pathway information is important for understanding the functionality of genes and proteins. EHCO integrated two well-known pathway databases, KEGG [] and BioCarta [], and displayed in each gene information page its associated pathways. Ontological terms for each gene were retrieved from the local version of Gene Ontology database, updated periodically by softbots. As shown in figure xxx, each associated ontological term is colored brown and traced all the way to its root term (biological process, cellular component, or molecular function). Homologies were provided from the local version of NCBI HomoloGene and euGenes. Possible functional annotations were predicted by first mapping the human gene to homologous genes in other species (using HomoloGene), then find matching functional annotations from existing phenotype annotations. Currently EHCO incorporated phenotypic information from FlyBase, MGD, SGD and WormBase. The deduced annotation is purely hypothetical, and should be used only as a reference. </p><p>2.3 Storing and Presenting the annotation 2.3.1 Annotation Engine When collected annotations were to be stored, the enrollment was done via HTTP protocol. Enrolled annotation strings were parsed by the annotation engine, and were not published onto the EHCO website until a content manager committed them. When committed, they went into the storage service, and were processed by the presentation engine. 2.3.2 Presentation Engine The presentation engine decided what, where, and how the annotations were to be organized. A template named GeneInfo was created to customize webpage appearances. Each annotation entity was assigned a category, class, and rank property so that the manager could easily adjust the annotation content. The presentation engine also adapted the Wiki mechanism to allow a more advanced commenting system. With Wiki, website users can freely create and edit annotation pages through any web browser. </p><p>3. Results and Discussions 3. 1 The architecture of EHCO The uniqueness of EHCO lies in its ability for registered users to share information. Aside from traditional browsing activities (reading webpages, downloading softwares, keyword searching), users are encouraged to contribute their own work onto EHCO. They can send comments, submit papers, or edit webpages. In a word, EHCO is an online community for HCC researchers around the world. EHCO adapts the PLONE platform, which is a free, open source content management system (CMS). PLONE provides value at every level of an organization, and comes with a workflow engine, pre-configured security and roles, a set of content types and multi-lingual support. It supports Wiki, forum, database, and over 30 languages. The advantage of using PLONE is that EHCO can be extended beyond the context of liver cancer research: the architecture is flexible, and with a slight modification it can be used to construct a knowledge base in other research fields. Liver Fibrosis, a sister site of EHCO, provides a liver necroinflammatory and fibrosis- related gene knowledge. Both sites shared the same server, same PLONE, same mySQL database, and most of the python codes. </p><p>3.2 HCC-related genes collection ?????????????????????????????????????????(這裡放交集圖, 說明收集到多少 gene) Fig 1. Gene-HCC relation knowledge base system</p><p>Fig.2 Block diagram of Annotation and Presentation Engine Table 1: Information Extraction Results of Gene alb and PubMed ID 10632334 PubMed ID Gene Protein mRNA Serum HBV HCV 10632334  - Methylation Liver HCC Cirrhosis Fibrosis Necrosis Regeneration   : appear +: increased expression -: decreased expression</p><p>Fig. 3 The architecture of EHCO </p><p>Fig. 4 GeneInfo page for A2M gene Table 2: Corpus Number of genes Number of papers Number of sentences 1017 10072 102968</p><p>Table 3: Information extraction results (measured in the number of occurrences) gene protein mRNA serum HBV HCV 14229 446 684 3836 1365 1031 methylatio liver HCC cirrhosis fibrosis necrosis n regeneratio n 377 146 14287 3439 421 1112</p>

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    9 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us