Userscripts for the Life Sciences Egon L Willighagen*1, Noel M O'boyle2, Harini Gopalakrishnan3, Dazhi Jiao3, Rajarshi Guha3, Christoph Steinbeck4 and David J Wild3
Total Page:16
File Type:pdf, Size:1020Kb
BMC Bioinformatics BioMed Central Software Open Access Userscripts for the Life Sciences Egon L Willighagen*1, Noel M O'Boyle2, Harini Gopalakrishnan3, Dazhi Jiao3, Rajarshi Guha3, Christoph Steinbeck4 and David J Wild3 Address: 1Cologne University Bioinformatics Center, Cologne University, Cologne, Germany, 2Cambridge Crystallographic Data Centre, Cambridge, UK, 3School of Informatics, Indiana University, Bloomington, USA and 4Wilhelm-Schickard-Institut, Center for Bioinformatics, University of Tübingen, Tübingen, Germany Email: Egon L Willighagen* - [email protected]; Noel M O'Boyle - [email protected]; Harini Gopalakrishnan - [email protected]; Dazhi Jiao - [email protected]; Rajarshi Guha - [email protected]; Christoph Steinbeck - [email protected]; David J Wild - [email protected] * Corresponding author Published: 21 December 2007 Received: 31 August 2007 Accepted: 21 December 2007 BMC Bioinformatics 2007, 8:487 doi:10.1186/1471-2105-8-487 This article is available from: http://www.biomedcentral.com/1471-2105/8/487 © 2007 Willighagen et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: The web has seen an explosion of chemistry and biology related resources in the last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available with a wide variety of types of information. There is a huge need to aggregate and organise this information. However, the sheer number of resources makes it unrealistic to link them all in a centralised manner. Instead, search engines to find information in those resources flourish, and formal languages like Resource Description Framework and Web Ontology Language are increasingly used to allow linking of resources. A recent development is the use of userscripts to change the appearance of web pages, by on-the-fly modification of the web content. This opens possibilities to aggregate information and computational results from different web resources into the web page of one of those resources. Results: Several userscripts are presented that enrich biology and chemistry related web resources by incorporating or linking to other computational or data sources on the web. The scripts make use of Greasemonkey-like plugins for web browsers and are written in JavaScript. Information from third-party resources are extracted using open Application Programming Interfaces, while common Universal Resource Locator schemes are used to make deep links to related information in that external resource. The userscripts presented here use a variety of techniques and resources, and show the potential of such scripts. Conclusion: This paper discusses a number of userscripts that aggregate information from two or more web resources. Examples are shown that enrich web pages with information from other resources, and show how information from web pages can be used to link to, search, and process information in other resources. Due to the nature of userscripts, scientists are able to select those scripts they find useful on a daily basis, as the scripts run directly in their own web browser rather than on the web server. This flexibility allows the scientists to tune the features of web resources to optimise their productivity. Page 1 of 12 (page number not for citation purposes) BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 Background as identifiers [13], indicating that a specific database entry The web has seen an explosion of chemistry and biology is related to the cited term in the ontology, and therefore related resources in the last 15 years: thousands of scien- related to entries from other databases annotated with tific journals, databases, wikis, blogs, and regular HTML that term. pages are available containing information relevant to chemists and biologists [1-4]. While each of those Identifiers that can be calculated algorithmically are even resources is valuable in itself, integrating information better, because they do not need to be looked up in a list from these resources increases the value even more: for of identifiers. Instead, anyone can calculate them from the example, PubChem provides a wealth of data but could be object itself. For example, for molecular structures the complemented with 3D models to create an even richer InChI [14] is the ideal replacement for database specific information source. identifiers such as the CAS registration number, the PubChem compound identifier and the ChEBI identifier. The original goal of the world wide web was to hyperlink These all require a look up or conversion table to convert individual web pages allowing humans to explore a web one identifier into another. Using the InChI, one can look of knowledge. For individual web pages these links can be up information in all databases without having to know created manually, as is still done in blogs, wikis, and static the database specific identifier. HTML pages; for large databases this is, however, not fea- sible. Userscripts are small programs that can alter the In addition to the unique identifier, one additional func- HTML content rendered by web browsers. For example, a tionality is needed to create a link to a particular database: userscript may add book prices from competitors to the the database must provide either an API (Application Pro- Amazon.com website, or may remove unwanted adver- gramming Interface) which can be queried using the iden- tisements from a site. Using the same approach, user- tifier or else provide a uniform scheme for deep linking to scripts can also solve the problem of interlinking web a web page containing information about the entry resources, by adding to web pages of one resource dynam- behind the identifier. For example, looking up structures ically generated hyperlinks into another. By selecting a in PubChem is done with a scheme in which the InChI is specific set of userscripts, the user can tune a website to embedded verbatim. To look up the structure of methane provide all kinds of facilities not anticipated by the origi- (InChI=1/CH4/h1H4), the URL nal author of the site. For example, userscripts have been http:www.ncbi.nlm.nih.gov/entrequery.fcgi?CMD=searc used in bioinformatics to enhance the iHOP web page [5]: h&DB=pccompound&term=%22InChI=1/CH4/h1H4%2 the script extracts user assigned tags from a third party 2[InChI] is used. resource, and shows them as a tag cloud on iHOP pages for particular genes. The plethora of resources is overwhelming, and both users and database developers may have preferred subsets, e.g. Automatic hyperlinking is only possible though the use of more trusted, resources. It is therefore worthwhile to have unique identifiers such as the PDB ID, the CAS registra- a system that allows users to choose which resources they tion number and, more recently, the IUPAC International want to have linked with which other resources. User- Chemical Identifier (InChI). While identifiers are easily scripts provide the necessary technology to allow this used to connect databases, such as done in the SRS system within web browsers. Here we describe several userscripts [6] or in meta database software like BioWarehouse [7], we have developed to create links between web resources the sheer number of web resources makes it impossible of interest to researchers in the life sciences. integrate all resources. Consequently, (bio)chemical search engines, such as ChemSpider [8] and tools to har- Implementation vest information from web resources, such as ChemX- We use the following techniques to link various web treme [9] and BioSpider [10,11], as well as systems that resources in this paper: userscripts, unique identifiers, standardize algorithmic access to resources and services, microformats, and web resource interfaces. The following such as BioMOBY [12], have emerged. sections describe how these are used in this work. Another reason why identifiers do not always allow link- Userscripts ing resources is that many of them are database specific, A userscript is a small program written in JavaScript that is such as the PDB ID and the Digital Object Identifier automatically run within a web browser (often by a plugin (DOI), and sometimes even restricted in being used, as or add-on) when the user accesses pages that match a par- with the CAS registration number. Open standard identi- ticular URL. Userscripts allow the user to modify the fiers address this problem. Such identifiers can be derived HTML content of a web page on-the-fly, by adding or from ontologies, dictionaries, encyclopedia, or computed removing elements or by moving them around. For exam- by an algorithm. The Gene Ontology terms are often used ple, userscripts exist that remove pop-up advertisements Page 2 of 12 (page number not for citation purposes) BMC Bioinformatics 2007, 8:487 http://www.biomedcentral.com/1471-2105/8/487 from web pages, and that alter the Amazon.com web page expressions to find certain strings in the text of the web to provide book prices from alternative suppliers. A repos- page. This works particularly well for identifiers with a itory of userscripts exists at userscripts.org/citeuserscripts- unique and well described syntax. For example, a regular dotorg. Chemists and biologists can find relevant expression for InChIs will have fewer false positives than userscripts by searching with the terms "chemistry" or one for PDB identifiers. "biology". As with any program that you run on your computer, it is Of the popular web browsers, only Opera provides built- important to consider security when installing userscripts. in support for userscripts (referred to as 'User JavaScript'). Although the security model used by Greasemonkey pre- To enable userscript support in other browsers, a third- vents attacks by malicious websites, it is unable to detect party extension needs to be installed: Greasemonkey [15] or prevent the user himself installing a malicious user- for Firefox, Creammonkey [16] for Safari, IE7pro [17] or script.