Improving Integration Effectiveness of ID Mapping based Biological Record Linkage Hasan M Jamil, Member, IEEE

Abstract—Traditionally, biological objects such as , , and pathways are represented by a convenient identifier, or ID, which is then used to cross reference, link and describe objects in biological databases. Relationships among the objects are often established using non-trivial and computationally complex ID mapping systems or converters, and are stored in authoritative databases such as UniGene, GeneCards, PIR and BioMart. Despite best efforts, such mappings are largely incomplete and riddled with false negatives. Consequently, data integration using record linkage that relies on these mappings produces poor quality of data, inadvertently leading to erroneous conclusions. In this paper, we discuss this largely ignored dimension of data integration, examine how the ubiquitous use of identifiers in biological databases is a significant barrier to knowledge fusion using distributed computational pipelines, and propose two algorithms for ad hoc and restriction free ID mapping of arbitrary types using online resources. We also propose two declarative statements for ID conversion and data integration based on ID mapping on-the-fly.

Index Terms—ID mapping, on-the-fly data integration, data fusion, declarative query language, computational pipeline, workflow. !

1INTRODUCTION using ID mapping tools such as DAVID [3], OntoTranslate [4], Ata integration in life sciences is frequently challenged Synergizer [5], GeneID Converter [6], IdBean [7], IDMapper D because of relying on identifiers or biological IDs tradi- [8], g:Profiler [9] and MatchMiner [10]. The materialized tionally used to represent objects such as organisms, cells and views must then be maintained by monitoring possible ID products, and concepts related to them such as functions, changes, migration, and relationship changes among the ob- pathways, and annotations. Such IDs are so prevalent in life jects, occurring in databases they do not control, or risk pro- sciences that almost all biological databases use them as ducing inconsistent and obsolete results, perhaps even facing shortcuts for representing objects they store and manage for loss of functionality [11]. These seemingly lower level ID public use. These IDs are also often used to assemble larger management issues contribute to a much larger and even more biological systems and represent more complex concepts. important information management problem in life sciences, The departure from value based object representation (as called ID mapping and object linkage. in relational databases) to ID based object representation To fully understand the complexity that the existing practice creates numerous complications, including the potential for creates for applications aimed to gather information, consider object duplication (as values are no longer unique, i.e., two the case of the GeneCards entry for the eukaryotic translation objects with identical values but different IDs are distinct initiation factor 2-alpha kinase 3 gene, commonly called though semantically equivalent), ID obsolescence (invalid IDs PERK. Under current practice, GeneCards assigned the ID due to discontinuation) and ID migration (assignment of new EIF2AK3 to PERK and described it as a coding gene IDs). All three types of problems persist largely because in human 2 that is about 70,836 base pairs long objects undergo routine formal curation processes and keep at genomic location 88,856,259 bp from the p-terminal on the getting perfected as new knowledge is gained and errors are minus strand. In the Ensembl database, however, the same removed. Furthermore, new IDs are assigned to objects as they gene has the ID ENSG00000172071, and has been assigned evolve. Slow or inadequate coordination between the different the IDs 3255 in HGNC, 94512 in Gene, 6040325 in databases can lead to chaotic cross referencing and erroneous OMIM and Q9NZJ53 in UniProtKB, even though they all knowledge space that users must then negotiate to access the represent the same biological object. It is therefore necessary needed information. for researchers to establish correspondence among these object One complex stubbornly persistent problem is inconsistency IDs if they wish to aggregate information about PERK that related to data warehousing type applications and services such is available only in each of these databases. When the scale as GeneCards, BioMart, RefSeq, UniGene, SWISS-PROT, and frequency of such tasks become overwhelming, automated PIR, GenBank, and EMBL as a by product of updates in tools which are able to effectively and efficiently aid in the materialized views [1], [2]. These warehouses often materi- aggregation process become handy. While existing data inte- alize relationships among objects represented using such IDs, gration and record linkage research has been effective in many which are either gathered from other databases or computed cases, research focusing on record linkage for aggregation based on biological IDs is either absent or scarce at best. • H. M. Jamil is with the Department of Computer Science, University of Our focus in this paper is on this neglected issue from the Idaho, Janssen Engineering, Room 236, Moscow, ID, 83844, USA. point of view of quality “best-effort” data integration [12], [13] E-mail: [email protected]. as opposed to generating ID mappings in ways systems such

Digital Object Identifier no. 10.1109/TCBB.2014.2355213 This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 2

as Synergizer, GeneID Converter, IDMapper, g:Profiler and MatchMiner do. Therefore, our goal is improving the quality of data integration and aggregation that relies on ID mappings generated by these mappers, as well as the warehouses, such as PIR, GeneCards, BioMart and UniGene, that maintain them. We identify two principal factors that impact the quality of integration – materialized view maintenance and access methods to mapping for information aggregation. We have addressed the issue of materialized view maintenance in an (a) Discontinued Entrez to HGNC mapping. earlier paper [11] as part of an ongoing research, and hence, though tangentially relevant, it will not be discussed here. The main focus is the access method to ID maps and developing a declarative approach to record linkage for data integration.

1.1 Warehouse Maintenance and Update Errors To highlight maintenance problems of ID mappings among objects in the warehouses, consider yet another case of the gene SMCR. Until recently, Entrez assigned it the ID 6600 and (b) Current Entrez to HGNC mapping. cross-referenced it to HGNC that had the ID 11113, as shown in figure 1(a). Assume the ID mapping warehouse, GeneCards, is trying to maintain this correspondence. As of January 6, 2013, not only have IDs 6600 and 11113 been discontinued, they have been assigned new Entrez ID 10743 and HGNC ID 9834 respectively, and the name has been changed to gene RAI1 (retinoic acid induced 1) as shown in figure 1(b). Several observations can be made in this regard. First, while the mapping from Entrez to HGNC IDs can be found using (c) Current HGNC entry for 9834. the Entrez database, the converse is not true, as shown in figure 1(c). In [4], Draghici et al pointed out that the accuracy Fig. 1. Illustration of ID discontinuation and warehouse of mapping depends largely on the choice of database. They maintenance hurdles. also argued that in theory, these databases are supposed to provide the most up-to-date ID mapping information and cross are involved in unfolded protein response, and to identify their reference each other, but they often fail to perform as expected chromosomal location as part of a neurodegenaration study as in the case of SMCR gene above. So, depending on the path involving the gene PERK. The process he follows involves two a user chooses, he might miss the mapping if he started with steps. First, he collects all entries from GeneCards database HGNC and thus, lose any chance of getting the annotations for 2 and 22, and distills into four columns, he needs from other databases that use Entrez ID to respond shown partially in the table in figure 2(a). He then submits to queries. Unfortunately, there are many such lapses. a keyword query in Affymterix database with terms “(mouse Second, from a maintenance standpoint, it is extremely or human) and (protein, unfolded, response, translation, tran- difficult to find out that IDs have changed and so should the scription, regulation)” to mean that he is interested in human mapping, unless we actively monitor these databases to dis- or mouse probsets that are related to molecular function or cover potential change in IDs. Unfortunately, these databases biological processes involving unfolded protein response or do not make ID migration at all computationally obvious to aid translation/transcription in general. Assuming that a semantic maintenance, thus making discovery difficult. The only practi- search was adopted by the query processor in the Affymetrix cal choice is then to recompute or download entire mappings database, we can expect it to return the partial table shown in periodically to stay current. Research in change management figure 2(b). Depending on how the search engine functions, if a has mainly focused on shallow web [14]–[17] and general view stricter key word search, or a high enough similarity threshold maintenance [2], [18], and is not directly applicable to ID was used, the tuple corresponding to the probe ID 211635 x at maintenance. In [11], we pursued a more active approach that probably will not be included in the response. blended continual querying and view maintenance to discover Regardless of the semantics used and which set we have such migrations with an aim to minimally and incrementally in table 2(b), it is impossible to establish the link be- maintain materialized views of ID maps as one of our ongoing tween the tuples without knowing the correspondence be- efforts. tween the IDs in these two tables, i.e., through the columns Gene ID and Probe Set ID. In this example, 1420011 s at 1.2 Information Aggregation using ID Conversion and 200670 at are two Affymetrix probes for the gene XBP1, Consider a task in which a researcher is trying to link genes while 11722147 at is a probe for PAIP2B. So, these three in either mouse chromosome 22 or human chromosome 2 that probes must join with these two genes. The probe 211635 x at

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 3

Gene ID Gene Name Start Location Chomosome XBP1 X-box binding protein 1 29,190,543 bp pterm 22 PAIP2B poly(A) binding protein interacting protein 2B 71,409,868 bp pterm 2 PERK eukaryotic translation initiation factor 2-alpha kinase 3 88,856,259 bp pterm 2 ...... (a) Gene Information from GeneCards database

Probe Set ID Organism Molecular Function Biological Process 1420011 s at Mouse protein dimerization activity unfolded protein response 11722147 at Human translation repressor activity neg regulation of translational initiation 200670 at Human transcription factor activity regulation of transcription 211635 x at Human receptor activity protein-chromophore linkage 210513 s at Human growth factor activity pos regulation of transcription from RNA 1453505 a at Mouse translation initiation factor response to unfolded protein ...... (b) Probe Set Information from Affymetrix database

Fig. 2. Gene product information.

is not related to any of the genes in table 2(a) (it is related [4], [28] show that depending on the context and type of ID to gene IGHA1); neither does it have any of the desired conversion sought, many of these well known converters and properties we are looking for, although it may have matched warehouses may perform poorly and eventually lead to the the selection threshold, while 210513 s at though functionally integration lapses discussed above. In an earlier research in relevant, is related to gene VEGFA (not shown in this table). 2006 that investigated the quality of mapping by ID conversion Most interestingly, the probe 1453505 a at is annotated to be warehouses, Draghici et al [4] point out that only 69.53% of related to gene EIF2AK3 in the Affymetrix database. Hence, the genes in UniGene can be mapped on Entrez Gene entries. even though we know that functionally, it is relevant, we are Furthermore, only 43.54% of the IDs in Entrez Gene could be not able to relate it to PERK without knowing that it is an mapped back to UniGene. An even more striking example alternate ID for EIF2AK3. Therefore, we are not only required was the mapping between GenBank dbEST and GenPept. to establish the map between 1453505 a at and EIF2AK3, we GenPept was supposed to contain the protein translations of also need to generate the transitive map between 1453505 a at the sequences in GenBank dbEST, and so mapping between and PERK via EIF2AK3. In other words, the traditional record these resources should have been relatively simple. Draghici linkage [19]–[25] and mapping techniques largely do not help, et al also discovered that 91.6% of the entries in dbEST could and a higher level mapping method is warranted. be mapped to GenPept entries while only 1.82% of the entries In particular, the current database approach to such cross- could be mapped from GenPept to dbEST. linking is application specific and requires user knowledge In our recent investigation [28], we compared eleven differ- that must be applied as part of the application semantics. One ent publicly available ID conversion tools – DAVID, BABE- of the two recent approaches to record linkage operation in LOMICS [29], g:Profiler, Clone/GeneID Converter, NetAffx autonomous data integration [26] relies on a semantic join [30], CRONOS [31], IDBean, The Synergizer, BioMart, based on key discovery [27]. This approach relies on a lower bioDBnet [32], MADGene [33], GeMDBJ [34], and Match- level discovery of key columns and join column identification, Miner – to determine the quality of their mapping over six and ultimately depends on a natural join operation on the different genomes, i.e., Mouse, Human, Drosophila, Plant, selected columns. As the preceding example suggests, record Yeast and Bacteria, for some of the most commonly performed linkage in the specific problem in figure 2 requires an auxiliary conversion types such as Affymetrix to Ensembl, Affymetrix ID mapping step using a database or ID mapping tool, and to HGNC symbol, Ensembl to Unigene, among others. These then taking a natural join between the two tables using the tools were selected after an extensive search (including ref- mapping just obtained, which is fundamentally different from erences of relevant papers) in PubMed and Google Scholar, existing record linking technologies advocated in the research which allowed the use of more than two genomes and more mentioned above. Therefore, users currently rely on a manual, than two common types of conversions. and off line ID mapping step, and translate the correspondence We extracted data sets of relevant ID lists of microarray into a relationship to cross reference, a method that is slow experiments from supplementary files of published original and error prone, prohibitive in terms of cost, and results in articles. We chose 100 randomized IDs from each data set, and significantly low quality data integration and aggregation. submitted those lists to the different converters using different conversion types. Results from each tool and conversion type were retrieved, stored, and compared. Performance was 1.3 Accuracy and Quality Implications defined as the percent of successful conversions (from the Conversion systems and tools take significant steps to fol- total submitted). The variability among genomes and tools was low well defined procedures to map IDs and make such determined by statistical two way ANOVA test, using by R ver- conversions accurate, while mapping warehouses maintain sion 2.1.3.1. In these tests, if p-value was less than or equal to them as accurately as possible. Despite the fact that many the significance level α (0.05), significant differences between of these systems are renowned and highly accessible, studies the variance of the treatment was concluded. The best and

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 4

worst tool for genomic ID conversion in each organism was determined according to the mean of successful conversions of each tool. Inconsistencies among genomes and tools were defined as the number of different IDs, for the same probe, between the different tools used. This was done for each data set for all genomes used in this study, and the analysis was presented as the percentage of inconsistencies in each genome and tool (from the total of successful conversions). The summary of our findings, shown in figure 3 in three (a) Affymetrix to HGNC mapping. charts, point to a systemic problem. Figure 3(a) shows that DAVID was successful in converting 97% of the IDs while GeneID Converter only mapped 47% for Affymetrix to HGNC type ID conversion, while the rest fall in between. As shown in figure 3(b), across the six chosen genomes, DAVID over- all performed well, while NetAffx performed flawlessly in bacteria. Figure 3(b) shows the best and worst performing conversion tools across six genomes for Affymetrix to gene symbol conversion and amply illustrates the fact that not (b) Six genome mapping. all mappers are good for all types of conversions on all genomes. For example, BioMart performed better on yeast than plant genome. Therefore, the overall performance of the tools shown in figure 3(c) for Affymetrix IDs to Ensembl and gene symbol conversions should not be surprising. It shows (in green) DAVID and NetAffx performing at 100% and 99.5% respectively1, while the worst performing tool for these conversions was Clone/GeneID2. Finally, IDBean, Synergizer and BioMart performed almost identically at 76% (c) Performance. in all genomes except plant and bacteria3. Clearly, mappings that are theoretically both meaningful Fig. 3. Mapping Statistics for 100 IDs. and useful cannot always be performed just by querying “authoritative” resources. These examples should adequately demonstrate that ID mappings cannot be done casually, by link operations because current technologies such as entity ad hoc, need-driven queries, or “quick-and-dirty” Perl scripts, resolution and key discovery fail, leaving users to deal with as most researchers currently do. They also attest to the fact this issue manually. Although a fairly common problem, that regardless of which site, who does the mapping, and how current integration efforts such as Galaxy [36], BioMoby [37], authoritative the sources are, there are still gaps that users must and BioMediator [38] do not address this issue. account for. That is exactly why users look for more accurate, While biologists are largely accustomed to using graphical complete, and reliable mappers, and often end up sifting interface based systems to access and manipulate data and through multiple converters to design their own ID converter implement computational pipelines, the flexibility and power which they believe is accurate for the needed analysis. This is of declarative languages in computational biology is well- a serious problem because there is no way of knowing which known [39], and many computational biology systems (e.g., database is the right choice for the most current information Galaxy [40], LifeDB [26], Taverna [41]) have been devel- without intimate knowledge of the database. As we discuss in oped using powerful declarative languages such as SQL and section 3, discovery and usage of these mappers is by no means XQuery. In the spirit of strengthening data management and easy, and users must progress through many steps before they integration using linguistic artifacts so that such constructs can accept a mapping for their research, mainly because there may be used as a primitive for higher level system design and is no guarantee of accuracy, quality or completeness. autonomous data integration, we pursue a declarative approach to ID mapping based on which graphical tools that biologists are more comfortable with, can be developed. 1.4 Our Contributions We thus identify ID mapping across biological databases The lack of quality in data integration due to ID mapping as a basic and primary component needed for quality data manifests itself in the form of a failed link (a form of join integration, and develop a declarative means to facilitate such [35]) operation or missing tuples in apparently successful mapping in an ID mapping warehouse called MapBase.We then exploit the autonomous nature of this declarative ID 1. We must note here that NetAffx shows poor performance in mouse (at mapping construct to propose another declarative statement 50%) and human (at 79%), individually. for record linkage to facilitate data integration and data fusion 2. Note that it is available only for mouse and human IDs respectively at 77% and 82%, and its performance in these genomes is not particularly good. directly. Our hope is that the abstractions supported in these 3. Again, we note that this tool is not applicable for these genomes. two statements will help biologists and application developers

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 5

build high level pipelines in traditional query languages such 3IDMAPPING FOR DATA INTEGRATION as SQL, data integration languages such as BioFlow [35], or In [26], we introduced a new ad hoc and on-the-fly data integration systems such as Galaxy, in an unprecedented way. integration system called LifeDB based on a new query language called BioFlow. LifeDB adopts “best-effort” [12], [13] information integration and extraction, and recognizes the 2RELATED RESEARCH fact that some queries may fail due to the method used for schema matching and table extraction. However, it liberates ID mapping has been a low key but ongoing effort for quite users from being forced to use specific databases, or in our some time. To overcome the inherent limitations, researchers case, specific conversion tools that are only applicable to have developed mechanisms to map IDs using software, specific ID types, thereby limiting their applications and ability although substantial efforts are ongoing to include authori- to explore. We subscribe to the idea that all databases are tative mapping information in major life sciences databases limited in their information content. We also believe that such as GeneCards, BioMart, RefSeq database, UniGene, in a constantly changing domain such as life sciences, new SWISS-PROT, GenBank, and EMBL. These databases have resources will appear, and researchers must be able to exploit been cross-referencing such mappings all along, even though those resources without delay. The lag time to catch up with completeness and quality are lacking. Unfortunately, these emerging knowledge for leading databases such as GeneCards, efforts are vulnerable to the ID management related problems PIR, etc. is too great and thereby impedes research. discussed above, often failing to include or maintain mappings, In view of the above discussion and in the interest of free and by design are not always comprehensive. In recognition form on-the-fly data integration involving arbitrary sites, we of this pressing problem, fortunately, providing explicit and advocate an on-the-fly ID mapping system that will not be improved support for ID conversion is gaining momentum. limited to specific ID types, as many systems currently do. While details of the actual mapping method for many We also do not advocate that all mapping be materialized conversion systems are not of much importance, their general ahead and maintained, as do GeneCards and others. The characteristics are. Most of these converters limit the type maintenance cost of such materialized mappings is significant. of IDs they can convert, and the databases they use to get In [11], we have presented the complexities associated with ID mapping information. Most rely on actual downloading and mapping changes and migration. It was observed that without maintenance of the source databases, making them slow to a very close collaboration with the source databases, it is very respond to change, error prone due to view management, difficult to maintain such mappings. One of the principles and incapable of including new knowledge. The most recent we have adopted in LifeDB is site autonomy, and we do not systems such as IdBean rely on complementing multiple sites require any coordination or cooperation from online databases, to account for missing mappings. In a way, their approach except unrestricted public access. is similar to ours; unlike IdBean, however, we do not need to limit ourselves to five sites, nor do we need to develop specific mapping algorithms to collect information. Similar 3.1 Active Mapping to IdBean, GeneID Converter uses six databases (Ensembl, An active mapping refers to a specific mapping request by NCBI, combined UniGene and PubMed, human chromosomal an application such as mapping HGNC IDs to UniProt IDs, location from UCSC, KEGG pathway, and Reactome) as or HGNC gene names to NetAffy IDs. In other words, in its source and thus, is not capable of new and emerging active mapping, the type of conversion is known. Since we capabilities. are interested in a real time online conversion, we adopt the From a functional standpoint, we are contrasting the wis- following method. dom of interacting with a limited number of “authoritative” Let U be the set of all possible ID conversion sites, and T databases through their graphical interfaces with the con- be the set of all possible ID types. We create an index Σ as a T comitant accuracy issues that plague this domain for specific function Σ : U × T → 2 such that every site maps a given types of mapping using exact formats they support, with ID type to a set of ID types. For example, GeneCards maps a source independent and restriction free universal database a gene ID to several different ID types including HGNC, En- using declarative language as queries. We are unaware of semble, OMIM, NetAffy, GO, UniProt and so on. It is usually any such effort that has addressed record linkage and data the case that a single database is the authority of a particular integration using linguistic constructs. In our unique approach, ID, and other databases or conversion services use their IDs we provide ID mapping as an abstract service that can be to map to other IDs. For example, UniProt is the authoritative considered as given, and used to retrieve mapping according database for UniProt IDs, and GeneCards is a service provider. T to the quality criteria needed. We ensure that the underly- To capture this, we assume that a mapping Θ : U → 2 exists ing implementation preserves that view and computes the such that ∀u,v(u ∈ U ∧v ∈ U ⇒Θ(u)∩Θ(v)=0/), i.e., no two mappings based on the semantics inherent in user queries. databases share authority over any ID type. We also assume The abstract view in MapBase frees the user from making that a priority relation among the conversion services exists educated guesses and then falling into the trap of undesirable such that whenever for types t1,t2 ∈ T , and sites u,v ∈ U , results, application specific program development, database t1 ∈ Σ(u,t2) and t1 ∈ Θ(v) hold, v u holds, i.e., v is more maintenance, or rudimentary format translations and sifting authoritative than u. To look up services, we also have an index and searching for target information. that lists sites based on the ID mappings they perform as the

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 6

U function Ω : T ×T → 2 as a dual of the function Σ. For the the accuracy and reliability of a specific converter. Algorithm sake of consistency, we require that whenever Ω(t1,t2)=u, 1 can be adapted to enforce this option simply by setting we have t2 ∈ Σ(u,t1) as well. S = {c}, and testing if Σ(c,t) ⊇{t1,...,tk}. An example of Given Σ and Θ, we are now able to define a generic ID convert statement is mapping function μ : DT → DT that users utilize as the convert r into UniGene, Entrez, EnsemblLoc, UCSCLoc, algorithm in figure 1 where DT is a domain of type t ∈ T . Affymetrix using GeneID, UniProt; The actual mapping function for each of the converters is For this statement, we expect our system to simulate the Ξ U ×  represented as the partial function : DT DT .In interaction shown in figure 4(a) and return the ID maps shown this function, we are assuming that a converter at site u is in figure 4(b) for the GeneID Converter when r contains the an authority on type t, then every mapping by other sites gene name OAZ1. Note that the output in figure 4(b) shows is overridden by the mapping of the authoritative site, i.e., missing mappings for attributes Ensembl, Affymetrix, Gen- Ξ(s,i) = ⊥ when ∃s,s ∈ S ∧ t ∈ Θ(s) holds. Algorithm 1 Bank, and PDBid (when included or selected). Additionally, returns a set of IDs of type t that are conflict free from the since UniProt is also included in the statement, we expect it mapping by the authoritative site. to return maps from its site too (not shown).

Algorithm 1: ID Mapper μ. 3.2 Passive Mapping Input:IDi of type t, and functions Ω,Σ and Θ. Output: All IDs j of type t . Passive mapping is fundamentally different, where given two

1 Let S = Ω(t,t ); IDs i and j, we are required to establish a mapping relationship 2 Let J = 0;/ 3 if ∃s,s ∈ S ∧t ∈ Θ(s) and Ξ(s,i) = ⊥ then if it exists. In other words, we are required to conclude for 4 Let J = Ξ(s,i); what sites s, j = Ξ(s,i), i.e., ∃s(Ξ(s,i)= j). We assume that 5 else ∈ the the type of IDs are known. In case we do not, we can 6 for every s S do τ  T 7 Let J = J ∪ Ξ(s,i); make use of a partial type assignment function : D that associates a type with an ID, if it exists. 8 return J; We emphasize that the type assignment function τ is critical for passive mapping as many databases assign IDs that are We leverage this algorithm to propose a declarative state- structurally identical. For example, Entrez, OMIM and HGNC ment of the form IDs are often indistinguishable without knowing the domain of

convert [refresh] r into t1 [any|one|all], ..., tk [any|one|all] the IDs (e.g., for the gene EIF2AK3, GeneCard lists HGNC: [using c] [for consensus]; 3255, Entrez Gene: 9451, and OMIM: 604032 as external IDs). Some databases follow distinctive protocols in naming IDs. where r is a unary relation of domain Dt , t1,...,tk are type names, and k ≥ 1. The result is a relation r ⊆ D ×D ×...× For example, GO (annotates IDs with “GO:”, GeneCards (uses t t1 χσ χ D . The options [any|one|all] allow mapping elements in r the protocol GC Kb where is the chromosome number, tk σ to either any available ID, exactly one, or all possible IDs is the strand Plus or Minus and Kb is the start position in respectively. In particular, option any will find one possible kilo base pairs) and Ensembl (annotates IDs with “ENSG”). map, whereas all will force MapBase to find a complete set of Therefore, it is important that we develop a robust mechanism τ mappings from all possible converters4. However, option one and algorithm for . An alternative is to create a hierarchical will be successful only if exactly one unique mapping is found type system similar to GO to organize ID types, assign those in all possible converters. These options are used for mapping types to all IDs explicitly, and discourage using Integer as type consistency and will have significant influence on the design for IDs, such as HGNC, Entrez and OMIM, for instance. of MapBase, as will be discussed in section 4. If no options Without a stricter convention and enforcement protocol, are specified, any is assumed. there is no guarantee that the pages returned by databases as The refresh option is available to improve efficiency. As query response will maintain such type/database annotations. we discuss in section 5.2, mappings are computed using a lazy Even if they do (such as in GeneCards), there are no guarantees differential computation model in which validated mappings that wrappers that extract the IDs will be able to recognize warehoused in MapBase are returned if the semantics de- them. As yet, we are unaware of any wrapper that can parse manded by the query warrants them. But, in reality mappings a GeneCards or NetAffy page to extract target information. may have been invalidated by changes or emergence of new These are extremely complex pages filled with information information since the mapping in MapBase has been ware- about gene products. Furthermore, they do not return simple housed. Using the option refresh will force a new computation tables as responses. Therefore, the sophistication demanded by to avoid such obsolete mappings. many of these sites is quite high. τ Technically, we compute relation r as the expression Once we have a robust , algorithm 2 can be used to see if  ( ) ( ) ( ) ×μ( ) t1 ×...×μ( ) tk μ( ) ti μ( ) two IDs are related as an implementation of the find function ∀u(u∈r) u u u , where u means u U φ : D × D → 2 . The algorithm may be adapted to suit with respect to type ti using algorithm 1. The using option may be used to compel conversion using a specific converter. specific application needs. For example, a relationship exists φ( , ) = This option becomes handy when the user is familiar with if i j 0./ It does not particularly look for authoritative converters as does algorithm 1. We can do so by requiring 4. If using option is used, from all included converters in using clause. relationship be maintained.

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 7

(a) Input interface showing submission of gene name OAZ1 and needed IDs selected.

(b) Output produced for OAZ1 and selected ID maps (also showing missing maps).

Fig. 4. Mapping gene name to various IDs using GeneID Converter.

Algorithm 2: Testing correspondence of IDs. can fully and universally serve all ID conversion needs. Input: IDs i and j, and functions Ω,Σ and Θ. Therefore, the overall functional architecture of MapBase as Output: Set of converters C. shown in figure 5 takes a dynamic, incremental approach to = 1 Let C 0;/ ID conversion and maintenance by building a database of map 2 Let S = Ω(τ(i),τ( j)); 3 for every s ∈ S do pairs on demand. The core of the system involves an ontology 4 if j = Ξ(s,i) then of biological IDs (such as HGNC, GC, ENSG, etc.), a database = ∪{ } 5 Let C C s ; and a query interface. The interface allows users to supply 6 return C; a homogenous set of IDs, and select a set of ID types to which the supplied IDs are to be converted, as well as an optional set of ID converters to be used for the conversion. Similar to the convert statement, we can now propose the Users are also allowed to choose a conversion cardinality such following extend statement for two relations in a way similar as1:1,1:M, and M : N. The system is also augmented with to θ-join operation. a knowledgebase of “all” online ID converters in which we maintain the location, capabilities, specialty, and accuracy rate , θ extend r s where [using c]; of each of the converters. The condition may include the usual boolean conditions simi- lar to relational model. If θ includes solely boolean conditions, 4.1 Ontology of IDs extend reduces to θ-join. But if it includes boolean mapping relationship conditions of the form map(r.A,s.B), then the join The ontology is a directed acyclic graph in the form of is additionally based on satisfaction of these conditions using an is-a hierarchy that classifies biological IDs into mutually algorithm 2. If using option is used, we enforce it simply by non-exclusive classes, such as sequences, proteins, pathways, setting S = {c}, and testing if Σ(c,τ(r.A)) ⊇{τ(s.B)}. genes, HGNC, GC, etc. Each class in the hierarchy serves as a grouping mechanism and points to a list of converters in the knowledgebase that is known to handle conversions of 4IMPLEMENTATION OF MAPBASE IDs of the class. These pointers are then used as indices to The design of MapBase is based on the premise that no single search for appropriate converters during query processing. To ID conversion database such as PIR, GeneCards, or BioMart construct possible candidates for conversion of a map pair, a

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 8

list intersection operation is carried out between sets of lists are polled from a set of databases identified as authorities, in the knowledgebase for each converter and the class list, in and organized in a priority order using a scheme similar to order to generate the actual target converters. IDChase [11]. This warehouse maintenance process ensures that no invalid IDs are ever mapped to ensure accuracy of

Ontology mappings returned as a query response.

Hash Index Updates of ID Converters ID Hierarchy

Query Processor 4.4 Query Processor

Differential Map Online Query Generator Processor A direct implementation of the MapBase query processor obviously requires at least the machineries to access online converter databases, submit IDs to be converted, and collect Validator Materialized View

Map Queries Response of ID Maps the returned mappings. Traditionally, two approaches may be adopted for the implementation of the two MapBase state- Integrator ments convert and extend– (i) dedicated and manual inte- Provenance Manager

Graphical User Interface Online gration of databases through cooperation, and (ii) employing ID Converters autonomous engines that require assembling data integration Online Queries technologies such as schema mapping, form filling, wrapper Fig. 5. MapBase architecture and components. generation and HTML page scrapping that will not demand site cooperation. While the former approach is expensive and straightforward, the latter approach is novel and cheaper 4.2 The ID Converter Knowledgebase though not entirely robust. For an autonomous approach, it We maintain a lazy knowledgebase for each online conversion is not difficult to imagine the following steps as a direct system in the form of a resource description. Each description implementation scheme for the convert statement. includes the address of the converter, the set of inputs it 1) identify the mapping capability of a converter (e.g., GeneID Converter [6]) from the resource description, and select converters. requires to convert IDs of a particular type, and the output 2) map query variables to converter entry form variables using a schema it produces, all in standardized terms in the ID Ontology. In matcher (e.g., PruSM [42]). this entry, we also maintain if this converter is an authoritative 3) using a form filling function (e.g., iForm [43]), submit the input IDs to the converter interface. site for a set of particular ID types so that all searches for 4) collect the mappings, and extract them using a wrapper (e.g., [44]). relevant ID types may begin at this site. Additional details 5) using a schema mapper, identify the required objects (columns in a also maintained in the form of meta-data include web service table) and project them as responses. standard, system response time, accuracy rate, and availability A similar algorithm can be devised for the implementation of of updates on ID migration. the extend statement, and both can then be implemented as an extension to a relational or an XML database. 4.3 Materialized Map Pair Database An alternative to direct implementation is to use a query In order to improve response time and reduce network latency, transformation based implementation strategy over an existing we materialize all query response in the form of a map pair system. Such a translation based implementation has been of IDs in the form of a quadruple σ,δ, χ, μ where σ is the exploited in many database query language implementations as source ID or the domain, δ is the destination ID or the range, an acceptable low cost alternative [45], [46]. In this paper, we χ is the cardinality of mapping (i.e., 1 : 1, 1 : M,orM : N)5, and follow this approach to implement MapBase to demonstrate μ is the converter used to map the ID. Request for ID mapping the feasibility of our system, using BioFlow [35] as our host is preferentially served from this map pair database, and only language. Recall that our main focus for MapBase is to support the mappings unavailable in this database are computed in rapid assimilation and integration of new online ID mapping real time. Since we materialize the mappings, we also adopt a systems with minimal user efforts. From this standpoint, the provenance procedure to guarantee that materialized maps are query translation approach is prudent and cost-effective in valid on an on-demand basis. ways similar to many high profile systems that have enjoyed Similar to most provenance models, we rely on ID change acceptance in the past [47], [48]. In fact, despite their relatively logs of objects that are materialized and maintained in the slower performance compared to a direct implementation, map pair data warehouse. We assume that systems that assign language translation has often been the de facto standard for IDs are also publishing the change logs users require for testing language properties [49], [50] due to the negligible the purpose of provenance. In section 4.4.1, we will discuss implementation cost, especially as a research prototype. how provenance computation depends on the change logs and the query being processed. Change logs, on the other hand, 4.4.1 Query Translation

5. While all other mapping cardinalities are well defined and examples While a complete exposition of the translation algorithms and abound, it is probably arguable if M:M mappings are meaningful or if even their design principles are not included in this paper, sufficient exist. In MapBase, however, we do not make that decision and let users decide. MapBase simply computes the results according to the semantics assigned by details are presented below for readers to have an overall the ID converters and demanded by the query. understanding of the processes involved.

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 9

4.4.1.1 BioFlow in Brief: For our purpose, we use the immediately and is not dependent on the completion of the recently released data integration system LifeDB [26] and its current perform statement. query language BioFlow specifically designed for life sciences 4.4.1.2 CONVERT Statement: Technically, the convert applications. In particular, BioFlow’s extract statement, shown statement requires collecting mapping results from a set of ID below, returns a single table for every tuple in relation r from converters, and allowing an arbitrary set of IDs to be mapped a deep web site in real time. to another arbitrary set of IDs, usually without any knowledge

extract A1,...,An of the conversion functions or converter. The choice of the using matcher ω wrapper ρ filler φ set of converters usually depends on their capabilities and from ϕ submit r where θ; their potential agreement with user query semantics. We can disassemble the statement into a series of convert statements In this statement, we are assuming that Ais are the mapped IDs extracted by wrapper ρ from a site ϕ for every tuple in first according to the mapping behavior of the converters with r. Any schema mismatch during extraction is handled by the matching semantics, and then rewriting them in BioFlow as a schema matcher ω, and all form submissions use form filler union query. For example, GeneID Converter allows multiple function φ. The where clause supported in extract statement mappings for a set of IDs of one kind (as shown in figure can be used to express filter conditions over relation r and the 4(a)), whereas the ID converter at UniProt will only map one form entries. ID type to one single ID type. Therefore, to map IDs to a BioFlow also supports two other statements called combine set of ID types and collecting them in one set will require and link to facilitate entity based [19] (as opposed to tuple a functionality similar to outer join. The convert statement based) union and join respectively, which can be used to in section 3.1 thus is implemented as the following BioFlow collect IDs from different sources, and for implementing the query using combine, link and extract by recognizing the extend statement despite representation heterogeneities. The fact that UniProt does not map IDs to ID types EnsemblLoc, syntax for the two statements are presented below. UCSCLoc, Affymetrix. combine (extract UniGene, Entrez, EnsemblLoc, UCSCLoc, combine r, s using matcher ω identifier κ; Affymetrix link r, s using matcher ω identifier κ; using matcher OntoMatch wrapper FastWrap filler iForm In the statements, as before, ω is a schema matcher, whereas from http://idconverter.bioinfo.cnio.es/ submit r), κ is a key discovery or entity discovery function such as (link (extract UniGene Gordian [27]. For the sake of conciseness, we do not elaborate using matcher OntoMatch wrapper FastWrap filler iForm further on these statements and refer the readers to [35] from http://www.uniprot.org/mapping/ submit r), for a complete exposition on BioFlow. Instead, we focus on (extract Entrez how these sentences can be used to capture the semantics of using matcher OntoMatch wrapper FastWrap filler iForm convert and extend in MapBase. from http://www.uniprot.org/mapping/ submit r) Finally, BioFlow supports arbitrary computational pipeline using identifier gordian) construction using desktop and online resources. It uses using matcher OntoMatch identifier gordian; named, possibly stored, processes as units of operation with its In this statement, OntoMatch [51] and FastWrap [52] respec- process statement perform to construct complex and powerful tively are schema matcher and wrapper generator available in pipelines or workflows. Since BioFlow allows the use of the BioFlow toolbox. multiple online resources in real time, there is a potential for 4.4.1.3 EXTEND Statement: An extend statement of the using parallelism and distributed computing to our advantage. form extend r,s where map(r.A,s.B) using UniProt can be So, unless there is a need to process services in each web site implemented using a simple mapping procedure in which we in sequence (such as when the input to one web site form rewrite it as a link statement shown below. depends on the output of a previous web site), we can submit link (link (r,(extract t ,t requests in parallel and collect in a desirable sequence. The A B using matcher OntoMatch wrapper FastWrap filler iForm syntax and semantics of the perform statement below allows from http://www.uniprot.org/mapping/ this flexibility. submit (select r.A from r)), perform [parallel] p ,...,p [after q ,...,q ] [leave]; 1 k 1 n using identifier gordian), s perform offers three optional controls – parallel, after and using identifier gordian); leave. If used, parallel means the named processes p ,...,p 1 k In the above query, tA and tB are ID types corresponding will be invoked in parallel. If the option parallel is not used, to the attributes r.A in r and s.B in s. The inner extract ,..., p1 pk are invoked in the sequence listed. after clause statement collects the map pairs from UniProt converter for ,..., results in invoking p1 pk as stated after the sequence of IDs in column r.A and pairs them up with ID type t in one ,..., B processes q1 qn are executed. The leave option transfers single table. This binary table is then used as a three way join control to the next statement in the script, and makes the between r and s via this mapping table, thereby implementing processes run in the background. Technically, without this the intended semantics of the extend statement. option, all invocations are in synchronous mode i.e., the next sentence begins execution when the current sentence com- 4.4.2 Query Processing Algorithm pletes execution. In contrast, all invocation is asynchronous The processing strategy is simple. First, the mapping need is when this option is used, i.e., the next sentence starts executing converted into quadruples of the form i f ,it ,c,m where i f is

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 10

the ID to be converted to ID it (which is currently empty), c 5.1 Semantics of convert Statement is the mapping cardinality, and m is the mapper to be used. To test various mapping conditions using convert,wehave First, if any ID in i f appears in the ID migration table, we used three subsets from the data sets used in [28]. The first remove the ID from the query list to avoid inconsistency. data set consists of 11 affymetrix IDs (about 10% of the Second, the query list is joined with the map pair table on random IDs used in [28]) that were used for a conversion from columns i f , c and m. If a target mapping exists in the map affymetrix to Ensembl IDs using all four conversion tools. The pair table (non-empty join) for an ID, we remove the ID from results are shown in the table in figure 6(a). Tables 6(b) and the query, and include the mapping found in the map pair 6(c), on the other hand, show UniGene to Ensembl mapping table directly into the query response. We, however, remove for 11 mouse and human UniGene IDs using BioMart and the mapping from the response and put it back in the query DAVID, and BioMart and MADGene, respectively. list for re-computation and materialization if it appears in the The tables in figure 6 were generated using the following ID migration table, again to avoid inconsistent mapping. After three convert statements where the three input tables affy- computing the query, we union the computed response with table, mouse-table and human-table contain a single column the subset collected from the map pair table. We avoid the of IDs of the type affymetrix, mouse UniGene and human second step and keep all the IDs in the query and compute UniGene respectively. the mappings if the refresh option is used. As a final step, t1 convert affy-table into ensembl all; we enforce the user specified cardinality constraints on the t2 convert mouse-table into ensembl any using response before they are materialized and returned to the user. BioMart, DAVID; t convert human-table into ensembl all; The refresh option is made available as a device to 3 avoid obsolete mappings warehoused locally in the MapBase In this example, we have engineered meta-data to inform database. Recall that re-computation of mappings is initiated MapBase that all four converters are able to accept affymetrix if update information of IDs is available, or mappings are not inputs to produce Ensembl mappings, whereas only BioMart materialized in MapBase. Furthermore, cardinality constraints and MADGene can accept human UniGene IDs for the same. are enforced after the differential computation has already In contrast, BioMart and DAVID are enabled for mouse taken place. That means, if for an ID i, two mappings are UniGene to Ensembl mapping. The option all in statement already available in MapBase, the up-to-date mapping for t1 and no mention of any converter were the reasons why all i will not be pursued. If the query included the constraint four converters and all mappings, including multiple mappings 1 : 1, the mapping will fail. The 1 : M mapping for i,in for 1436713 s at and mismatched mappings for 1416945 at, this case, must have been computed and materialized for 1417042 at, 1423163 at, and 1423945 a at (shown in red) which the cardinality constraint was omitted, or were 1 : M were included in the corresponding table in figure 6(a). Rows for these IDs were eliminated when we issued the statement or M : N probably reflecting that one or many converters or databases supplied that mapping, possibly in error. Since the t1 below because we sought consensus among all included materialization, this error may have been corrected and a re- converters. Note that rows for IDs 1428301 at, 1429752 x at, computation may result in a more accurate reflection of the and 1456492 at were retained because MapBase treats no null real mapping. Conversely, a materialized 1 : 1 mapping for i match as values, i.e., consensus constraints are enforced only on non null values. On the other hand, when the statement will prevent a new query for 1 : M mapping to take place, which could be facilitated using refresh. The exact usage of t1 was issued, only the column for g:Profiler was reported this option is thus a query optimization issue and offers good with no match corresponding to ID 1436713 s at because this opportunities to make intelligent choices. converter reported more than one map for this ID, and thereby violated the constraint one, i.e., 1 : 1 mapping. However,

issuing the statement t1 resulted in the elimination of the entire row for 1436713 s at simply because consensus of 1:1 5INITIAL EMPIRICAL EVALUATION mapping across converters was sought. t convert affy-table into ensembl all for consensus; This evaluation section has two basic goals. First, in order 1 t1 convert affy-table into ensembl one; to experimentally validate the hypothesis inherent in the Map- t convert affy-table into ensembl one for consensus; Base design that the convert statement serves as an abstraction 1 In statement t , had we included MADGene as a possible for ID mapping with cardinality constraint over a universal 2 converter, a column for MADGene IDs would have been database of all ID conversions, we have used five different included with null values as MADGene is not known to be able conversion tools – g:Profiler, DAVID, Clone/ID Converter, to convert mouse UniGene to Ensembl in MapBase according BioMart, and MADGene, and a subset of IDs used in our to our meta-data. A consensus mapping as shown in statement recent comparative study in [28] to compute mappings under t2 below had no effect on the generated mapping. different query semantics permitted in this statement. We have then compared the results obtained in MapBase with the results t2 convert mouse-table into ensembl any using in [28] and validated that convert computed the correct results BioMart, DAVID, MADGene for consensus; 100% of the time. Second, it highlights the fact that this evaluation is more of a proof-of-concept than an exhaustive 5.2 Performance Considerations and Parameters experimental study of the effectiveness and efficiency of the The parameters refresh, any, one, all and consensus play proposed approach. significant roles in deciding how the convert, and transitively

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 11

Affymetrix ID g:Profiler Clone/ID Converter DAVID MADGene Comments 1416247 at ENSMUSG00000028447 ENSMUSG00000028447 ENSMUSG00000028447 ENSMUSG00000028447 Consensus mapping 1416909 at ENSMUSG00000010607 ENSMUSG00000010607 ENSMUSG00000010607 ENSMUSG00000010607 Consensus mapping 1416945 at ENSMUSG00000038502 ENSMUSG00000002963∗ ENSMUSG00000038502 ENSMUSG00000038502 Clone/ID Converter dis- agrees 1417042 at ENSMUSG00000032114 ENSMUSG00000032112∗ ENSMUSG00000032114 ENSMUSG00000032114 Clone/ID Converter dis- garees 1419759 at ENSMUSG00000040584 ENSMUSG00000040584 ENSMUSG00000040584 ENSMUSG00000040584 Uniform mapping 1423163 at ENSMUSG00000092417∗ ENSMUSG00000024389 ENSMUSG00000024389 ENSMUSG00000024389 g:Profiler disgarees 1423945 a at ENSMUSG00000035268 ENSMUSG00000017697∗ ENSMUSG00000035268 ENSMUSG00000035268 Cloe/ID Converter dis- agrees 1428301 at ENSMUSG00000079386 no match no match no match Only g:Profiler mapping available 1429752 x at ENSMUSG00000024059 no match ENSMUSG00000024059 ENSMUSG00000024059 Missing Clone/ID Con- verter mapping ENSMUSG00000084535 1436713 s at no match ENSMUSG00000021268 ENSMUSG00000021268 Partial match for DAVID ENSMUSG00000021268 and MADGene, and missing map for Clone/ID Converter 1456492 at ENSMUSG00000031858 no match ENSMUSG00000031858 ENSMUSG00000031858 Missing Clone/ID Con- verter mapping (a) Affymetrix to Ensembl ID mapping using g:Profiler, Clone/ID Converter, DAVID and MADGene

UniGene ID BioMart DAVID Comments Mm.100399 ENSMUSG00000024515 ENSMUSG00000024515 Consensus mapping Mm.100615 ENSMUSG00000056608 ENSMUSG00000056608 Consensus mapping Mm.10109 no match ENSMUSG00000021611 Missing BioMart mapping Mm.101369 ENSMUSG00000020573 ENSMUSG00000020573 Consensus mapping Mm.1019 no match ENSMUSG00000025746 Missing BioMart mapping Mm.158700 ENSMUSG00000021614 ENSMUSG00000021614 Consensus mapping ENSMUSG00000020941 Mm.158981 ENSMUSG00000020941 Partial BioMart Mapping ENSMUSG00000078635 Mm.184163 ENSMUSG00000000441 ENSMUSG00000000441 Consensus mapping ENSMUSG00000041144 ENSMUSG00000067440 Mm.184692 ENSMUSG00000096141 Partial BioMart Mapping ENSMUSG00000067536 ENSMUSG00000073694 Mm.1892 ENSMUSG00000018634 ENSMUSG00000018634 Consensus mapping ENSMUSG00000028076 Mm.1894 ENSMUSG00000028076 Partial BioMart Mapping ENSMUSG00000041750 (b) Mouse UniGene-Ensembl mapping using BioMart and DAVID.

UniGene ID MADGene BioMart Comments ENSG00000186868 Hs.101174 ENSG00000186868 Partial BioMart mapping ENSG00000235320 Hs.103183 ENSG00000102081 no match No BioMart mapping Hs.118262 ENSG00000151067 no match No BioMart mapping ENSG00000075624 Hs.1288 ENSG00000143632 Partial BioMart mapping ENSG00000143632 ENSG00000183214 ENSG00000204516 ENSG00000204520 ENSG00000183214 ENSG00000206449 ENSG00000204520 Hs.130838 ENSG00000227772 ENSG00000231225 Partial BioMart mapping ENSG00000231225 ENSG00000233051 ENSG00000233051 ENSG00000235233∗ ENSG00000234218 ENSG00000238289 ENSG00000160323 Hs.131433 ENSG00000160323 Partial MADGene mapping ENSG00000260099 Hs.147433 ENSG00000132646 no match No BioMart mapping Hs.148078 ENSG00000127481 ENSG00000127481 Consensus mapping Hs.149037 ENSG00000147246 no match No BioMart mapping Hs.149261 ENSG00000159216 ENSG00000159216 Consensus mapping Hs.154299 ENSG00000164251 no match No BioMart mapping (c) Human UniGene-Ensembl mapping using BioMart and MADGene.

Fig. 6. Affymetrix and UniGene to Ensembl ID mapping under various conditions and options in MapBase using convert statement implemented based on a translation to BioFlow.

the extend, statements execute and the costs they incur. of Ensembl IDs corresponding to BioMart and MADGene. In this context, we must point out that the semantics for Note that the BioMart map Hs.130838 to ENSG00000235233 consensus are still evolving in MapBase as we accept the (shown in starred blue) is not included in the MADGene set current interpretation as exact match across all mappers that of mappings while the rest are, i.e., these two sets intersect return non null values. To understand this from a technical but one is not contained in another. So, choosing to accept

standpoint, consider the case of human UniGene ID Hs.130838 inclusion as the semantics for consensus as shown in t3 as shown in the table in figure 6(c) where it shows two sets below will eliminate the row for Hs.130838 while we will

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 12

retain the rows for 1436713 s at in figure 6(a), Mm.158981 5.3 MapBase Prototype and Mm.184692 in figure 6(b), and Hs.101174, Hs.1288 and We have implemented a partially manual version of MapBase Hs.131433 in figure 6(c) respectively. Therefore, we chose the that has been rapid-prototyped by our application building stricter semantics in favor of efficiency as testing for subset toolkit VizBuilder [53]. This version is able to generate containment is relatively more expensive. the BioFlow statements needed to map IDs from sites that

t3 convert human-table into ensembl all for do not use JavaScripts fully automatically for execution in consensus; LifeDB. However, many mapping sites, such as BioMart, In our current implementation, we have also accepted a extensively use JavaScript and other advanced design tech- weaker enforcement choice for the refresh option available nologies that render traditional wrappers largely ineffective. for convert statement in favor of efficiency. Real time online To counter this weakness of contemporary wrappers, we query processing in a distributed architecture such as in Map- have implemented an online version of MapBase (available Base can become considerably expensive with poor response at http://mapbase.nkn.uidaho.edu/MapBase/) in Java with Se- time, especially when an open ended ID conversion query is lenium API for a limited number of sites (e.g., DAVID, submitted. For example, a query of the form in statement s1 g:Profiler, MADGene, BioMart and Clone/GeneID Converter). below does not restrict the number of converters and requires This choice increases the time MapBase takes to process that we compute an M : N mapping for consensus. each site as virtual terminals, even though not all sites use JavaScript, but it ensures that MapBase will not break down s convert input-table into ensembl any for 1 due to JavaScripts. Devising methods to dynamically detect consensus; sites with JavaScripts and efficiently generate site specific Technically, all converters indexed by MapBase capable of wrappers is an ongoing research. Once developed, dynamic executing this conversion is a potential candidate. Since option querying of arbitrary converters will become possible, an en- any is used, we are required to gather all possible maps and hancement we plan to include in our next release of MapBase. check for consensus. In the absence of any cost estimates for online resources such as BioMart and DAVID, a cost- based optimization of a convert query is difficult to plan. But, 5.4 Query Performance and Complexity BioFlow’s perform parallel construct for workflow construc- It is worth noting that developing a space-time theoretical tion may be used to expedite the processing by submitting con- bound for a system such as MapBase is exceedingly difficult version tasks to all sites simultaneously to increase throughput. and deeply rooted in a large body of past research. It involves The refresh, and the consensus and cardinality constraint components that are hard to characterize such as networks, options any, one and all offer an additional opportunity to warehouses, a large number of data interoperability artifacts drastically reduce the computation time of convert. such as schema matchers and wrappers, translation algorithms, Given that the update frequency is quite low in this do- and so on. Since our goal is to empirically demonstrate the main, we chose to materialize mappings in the Map Pair idea and its merit, we choose to do so experimentally, and table in MapBase to avoid repeated computation by removing leave a more theoretical investigation as future research. We, query IDs for which requested mapping information is locally however, note that the query cost is largely the sum total of the available. Since materialized mapping may become obsolete costs for query translation t, external processing e, network due to updates, a maintenance procedure is needed for the transmission n, and local remedial processing l. Since the Map Pair table. We chose to use the refresh option to force relationship t  l  (e + n) holds in general as we discuss a re-computation instead of active incremental maintenance next, it is safe to assume that the computational complexity commonly found in warehouses with high update frequencies. of MapBase queries is a complex function of largely external Opportunities still exist to design heuristics to intelligently costs, i.e., O(e+n). Since MapBase processes external queries ( apply refresh based on potential update information made in parallel, the cost e is max e1,e2,...,ek) for each external available by converters and query cardinality constraints. The query qi,1≤ i ≤ k. decision to refresh may also be based on materialization time Since MapBase takes additional time to extract and process and may be ignored if it is recent, with the assumption that data from each site in parallel, usually the time taken is re-computation will not yield disagreement. slightly higher than direct user queries at individual sites. In addition to refresh, the options one and consensus However, MapBase always performs better if mappings are have significant impact on how convert is executed. Whether sought from multiple sites in bulk and user constraints are we decide to refresh or not, we can always decide to stop part of the requirement, which users usually perform off line mapping IDs for which we already know these constraints no at a significant cost. To emphasize this point, we show the longer hold. For example, if we know that two converters have execution times for each of the queries in sections 5.1 and 5.2 already produced multiple mappings, or disagreements, and in figure 7 and contrast with direct user querying manually. In there is no hope for other converters to reverse that decision, this figure, the entry [(14,2), 20] under the converter g:Profiler time should not be spent computing them. To effect this in row 1 means that MapBase takes 14 seconds to gather the approach, a more reasonable plan is to execute the conversions mappings from the converter, and 2 seconds to process them in serial and not use the parallel option of the perform to test user constraints (e.g., 1 : 1, 1 : M, consensus) using statement in BioFlow so that we have the opportunity to Apache Derby locally and displaying the results. The entry remove IDs for the next conversion in the pipeline. 20 is the time required for querying directly using the web

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 13

Row Query Constraint g:Profiler ID Converter DAVID MADGene BioMart Cumulative Time Real Time

1 t1 Any [(14,2), 20] [(26,2), 31] [(7,2), 16] [(17,2), 21] - [(64,8), 88] [28, 88] 2 t1 1-1 [(14,3), >20] [(26,3), >31] [(8,3), >16] [(16,3), >21] - [(64,12), >88] [29, >88] 3 t1 1-M [(20,2), >20] [(19,2), >31] [(12,2), >16] [(14,2), >21] - [(65,8), >88] [22, >88] 4 t1 consensus [(11,2), 20] [(19,2), 31] [(14,2), 16] [(16,2), 21] - [(60,8), 88] [21, 88] 5 t2 Any - - [(11,2), 23] - [(7,5), 18] [(18,7), 41] [13, 41] 6 t2 1-1 - - [(11,3), >23] - [(6,5), >18] [(17,8), >41] [14, >41] 7 t2 1-M - - [(12,3), >23] - [(7,5), >18] [(19,8), >41] [15, >41] 8 t2 consensus - - [(9,3), 23] - [(10,7), 18] [(19,10), 41] [17, 41] 9 t3 Any - - - [(9,2), 8] [(6,4), 13] [(15,6), 21] [11, 21] 10 t3 1-1 - - - [(11,2), >8] [(6,3), >13] [(17,5), >21] [13, >21] 11 t3 1-M - - - [(10,3), >8] [(6,5), >13] [(16,8), >21] [13, >21] 12 t3 consensus - - - [(6,2), 8] [(10,4), 13] [(16,6), 21] [14, 21] Fig. 7. Execution time for queries in section 5. All times shown are rounded to nearest seconds.

interface instead. The entries of the form [(20,2), >20] and used to gather mappings based on their application semantics. [(11,2), 20] in rows 3 and 4 under the converter g:Profiler Such an approach offers greater control and mapping choice respectively, mean the manual time taken is higher or much at a high level of abstraction. higher than 20 seconds, which is expected in the case of The query translation based implementation in BioFlow manual constraint enforcement by visual inspection. Finally, and the overall framework proposed offer many optimization the last two columns – “Cumulative Time” (e.g., [(64,8), 88] is opportunities, some of which have been highlighted. While a the element wise sum total of the entries [(14,2), 20], [(26,2), complete implementation of MapBase remains as an ongoing 31], [(7,2), 16] and [(17,2), 21] in row 1) and “Real Time” research, the current prototype though fragmented and partly (e.g., [28, 88] computed as [max((14+2),(26+2),(7+2),(17+2)), manually assembled, amply demonstrates its efficacy, utility sum(20,31,16,21)] corresponding to the entries in row 1 under and superiority in terms of declarative ID mapping. MapBase each converter), show the total individual time for all con- stands apart from warehouses such as GeneCards, and tools verters, and the actual real time the user spends in mapping. such as IDBean in that users have greater control and choices, Since MapBase queries are processed in parallel, the real and better chances to have access to novel and emerging time performance is equal to the total time taken for the mapping systems. A theoretical upper bound on MapBase’s slowest converter. We must note that these times are usually query complexity also remains as one of our main future dependent upon network traffic and server load and hence is research goal. not a constant. So are the computed results as the underlying database is likely to change. ACKNOWLEDGMENTS

6CONCLUSION AND FUTURE PLANS Research supported in part by National Science Foundation grant IIS 0612203. The experimental setup in section 5 was Our main goal in this paper was to introduce a novel mech- designed with the help of Aminul Islam who implemented the anism for ID mapping based record linkage in biological BioFlow scripts for execution in LifeDB. The overall mapping databases using declarative means. To this end, we have statistics and mapping quality summarized in section 1.3 was introduced a new declarative statement called extend that carried out as a joint research discussed in [28]. The current can be used in database query languages such as SQL or MapBase prototype was developed by Maria J. Cepeda. BioFlow. We have also proposed an implementation strategy for this statement including all necessary machineries. To our knowledge, record linkage based on ID mapping is unprece- REFERENCES dented and advances state of the art of data fusion in general. The key to this development is the recognition that usual [1] R. Chirkova and J. Yang, “Materialized views,” Foundations and Trends in Databases, vol. 4, no. 4, pp. 295–405, 2012. key discovery in record linkage technology is ineffective and [2] A. Gupta and I. S. Mumick, “Maintenance of materialized views: needs to be replaced with the notion of dynamic ID mapping. Problems, techniques, and applications,” IEEE Data Eng. Bull., vol. 18, We have demonstrated that ID mapping based on-the-fly no. 2, pp. 3–18, 1995. [3] D. W. Huang, B. T. Sherman, R. Stephens, M. W. Baseler, H. C. Lane, data fusion and data integration is feasible by assembling and R. A. Lempicki, “DAVID gene ID conversion tool,” Bioinformation, appropriate machineries, which was not previously possible vol. 2, no. 10, pp. 428–430, 2008. using traditional means. [4] S. Draghici, S. Sellamuthu, and P. Khatri, “Babel’s tower revisited: a universal resource for cross-referencing across annotation databases,” In the interest of user transparency and maintaining higher Bioinformatics, vol. 22, no. 23, pp. 2934–2939, 2006. levels of abstractions, mapping of IDs need to be modeled [5] G. F. Berriz and F. P. Roth, “The synergizer service for translating gene, as an online real time service as a background task, which protein and other biological identifiers,” Bioinformatics, vol. 24, no. 19, pp. 2272–2273, 2008. we have implemented as another declarative statement called [6] A. Alibes,´ P. Yankilevich, A. Canada,˜ and R. D´ıaz-Uriarte, “Idconverter convert. We have then leveraged this statement to propose a and idclight: Conversion and annotation of gene and protein ids,” BMC ID mapping service called MapBase as a graphical interface Bioinformatics, vol. 8, 2007. [7] S. Lee, B. Kim, H. Kim, H. Lee, and U. Yu, “IdBean: a java GUI ap- for general use in life sciences. In MapBase, we support real plication for conversion of biological identifiers,” BMB reports, vol. 44, time mapping with various quality control attributes that can be no. 2, pp. 107–112, Feb. 2011.

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TCBB.2014.2355213

IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. X, NO. Y, MONTH 2013 14

[8] H. Lee, H. Kim, and U. Yu, “Idmapper: A java application for id [35] H. M. Jamil, A. Islam, and S. Hossain, “A declarative language and mapping across multiple cross-referencing providers,” Genomics and toolkit for scientific workflow implementation and execution,” IJBPIM, Informatics, vol. 7, no. 4, pp. 208–211, December 2009. vol. 5, no. 1, pp. 3–17, 2010. [9] “g:Profiler,” http://biit.cs.ut.ee/gprofiler/gconvert.cgi. [36] J. Goecks, A. Nekrutenko, J. Taylor, and Galaxy Team, “Galaxy: a [10] “MatchMiner,” http://discover.nci.nih.gov/matchminer/MatchMinerLookup.jsp. comprehensive approach for supporting accessible, reproducible, and [11] A. Bhattacharjee, A. Islam, H. M. Jamil, and D. Wildman, “IDChase: transparent computational research in the life sciences.” Genome biology, Mitigating identifier migration trap in biological databases,” in IC3. vol. 11, no. 8, pp. R86+, Aug. 2010. Noida, India: Springer, August 2009. [37] E. A. Kawas, M. Senger, and M. D. Wilkinson, “Biomoby extensions [12] W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan, to the taverna workflow management and enactment software,” BMC “Toward best-effort information extraction,” in SIGMOD, 2008, pp. Bioinformatics, vol. 7, p. 523, 2006. 1031–1042. [38] E. Cadag, B. Louie, P. J. Myler, and P. Tarczy-Hornoch, “Biomediator [13] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova, “Data integration data integration and inference for functional annotation of anonymous for the relational web,” PVLDB, vol. 2, no. 1, pp. 1090–1101, 2009. sequences,” in Pacific Symposium on Biocomputing, 2007, pp. 343–354. [14] L. Liu, C. Pu, and W. Tang, “Continual queries for internet scale event- [39] J. M. Patel, “The role of declarative querying in bioinformatics,” OMICS, driven information delivery,” IEEE Trans. Knowl. Data Eng., vol. 11, vol. 7, no. 1, pp. 89–92, 2003. no. 4, pp. 610–628, 1999. [40] D. Blankenberg, N. Coraor, G. Von Kuster, J. Taylor, and A. Nekrutenko, [15] L. Liu, C. Pu, W. Tang, and W. Han, “CONQUER: A continual query “Integrating diverse databases into an unified analysis framework: a system for update monitoring,” in in the WWW. International Journal Galaxy approach,” Database, vol. 2011, Jan. 2011. of Computer Systems, Science and Engineering, 1999. [41] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li, [16] S. Khan and P. L. Mott, “LeedsCQ: A scalable continual queries system,” and T. Oinn, “Taverna: a tool for building and running workflows of in DEXA, 2002, pp. 607–617. services.” Nucleic Acids Res, vol. 34, July 2006, web Server issue. [17] W. Tang, R. K. Jones, L. Liu, and C. Pu, “BizCQ: using continual [42] T. H. Nguyen, H. Nguyen, and J. Freire, “PruSM: a prudent schema queries to cope with changes in business information exchange,” in matching approach for web forms,” in CIKM, 2010, pp. 1385–1388. WWW (Alternate Track Papers & Posters), 2004, pp. 470–471. [43] G. A. Toda, E. Cortez, A. S. da Silva, and E. S. de Moura, “A prob- [18] H. K. Jain and A. Gosain, “A comprehensive study of view maintenance abilistic approach for automatically filling form-based web interfaces,” approaches in data warehousing evolution,” ACM SIGSOFT Software PVLDB, vol. 4, no. 3, pp. 151–160, 2010. Engineering Notes, vol. 37, no. 5, pp. 1–8, 2012. [44] N. N. Dalvi, R. Kumar, and M. A. Soliman, “Automatic wrappers for [19] G. Papadakis, E. Ioannou, C. Niederee,´ and P. Fankhauser, “Efficient large scale web extraction,” PVLDB, vol. 4, no. 4, pp. 219–230, 2011. entity resolution for large heterogeneous information spaces,” in WSDM, [45] W. Fan, J. X. Yu, J. Li, B. Ding, and L. Qin, “Query translation from 2011, pp. 535–544. xpath to sql in the presence of recursive dtds,” VLDB J., vol. 18, no. 4, [20] J. Finney, A. Walker, T. Peto, and D. Wyllie, “An efficient record linkage pp. 857–883, 2009. scheme using graphical analysis for identifier error detection,” BMC [46] C. T. Yu, Y. Zhang, W. Meng, W. Kim, G. Wang, T. Pham, and S. Dao, Medical Informatics and Decision Making, vol. 11, no. 1, p. 7, 2011. “Translation of object-oriented queries to relational queries,” in ICDE, [21] L. Jin, C. Li, and S. Mehrotra, “Efficient record linkage in large data 1995, pp. 90–97. sets,” in DASFAA, 2003, pp. 137–. [47] H. Liang, W. Zuo, F. Ren, and C. Sun, “Accessing deep web using [22] M. Michelson and C. A. Knoblock, “Learning blocking schemes for automatic query translation technique,” in FSKD (2), 2008, pp. 267– record linkage,” in AAAI, 2006. 271. [23] H. Kopcke,¨ A. Thor, and E. Rahm, “Evaluation of entity resolution [48] A. Afonso, L. da C. Brito, and O. Vale, “An evolutionary method for approaches on real-world match problems,” PVLDB, vol. 3, no. 1, pp. natural language to sql translation,” in SEAL, 2008, pp. 432–441. 484–493, 2010. [49] S. D. Urban and T. B. Abdellatif, “Object-oriented query language access [24] S. Guo, X. Dong, D. Srivastava, and R. Zajac, “Record linkage with to relational databases: A semantic framework for query translation,” uniqueness constraints and erroneous values,” PVLDB, vol. 3, no. 1, pp. Journal of Systems Integration, vol. 5, no. 2, pp. 123–156, 1995. 417–428, 2010. [50] D. DeHaan, D. Toman, M. P. Consens, and M. T. Ozsu,¨ “A compre- [25] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi, hensive xquery to sql translation using dynamic interval encoding,” in “Behavior based record linkage,” PVLDB, vol. 3, no. 1, pp. 439–448, SIGMOD Conference, 2003, pp. 623–634. 2010. [51] A. Bhattacharjee and H. M. Jamil, “OntoMatch: A monotonically [26] A. Bhattacharjee, A. Islam, M. S. Amin, S. Hossain, S. Hosain, H. M. improving schema matching system for autonomous data integration,” Jamil, and L. Lipovich, “On-the-fly integration and ad hoc querying of in IEEE IRI, August 2009, pp. 318–323. life sciences databases using LifeDB,” in DEXA, 2009, pp. 561–575. [52] M. S. Amin and H. M. Jamil, “An efficient web-based wrapper and [27] Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald, “GORDIAN: annotator for tabular data,” IJSEKE, vol. 20, no. 2, pp. 215–231, 2010. efficient and scalable discovery of composite keys,” in VLDB, 2006, [53] H. M. Jamil, “Designing integrated computational biology pipelines pp. 691–702. visually,” TCBB, vol. 10, no. 3, pp. 605–618, May/June 2013. [28] D. Cruz, H. Jamil, and D. Forero, “Performance comparison of compu- tational tools for the conversion of genomic identifiers,” University of Antonio Narino, Bogota, Colombia, Tech. Rep., March 2013. [29] F. Al-Shahrour, P. Minguez, J. Tarraga, D. Montaner, E. Alloza, J. M. Vaquerizas, L. Conde, C. Blaschke, J. Vera, and J. Dopazo, “BABE- LOMICS: a systems biology perspective in the functional annotation of Hasan M Jamil (M’00) received the BS and MS genome-scale experiments,” Nucl. Acids Res., vol. 34, no. suppl 2, pp. degrees in applied physics and electronics from W472–476, Jul. 2006. the University of Dhaka, Bangladesh, in 1982 [30] G. Liu, A. E. Loraine, R. Shigeta, M. Cline, J. Cheng, V. Valmeekam, and 1984, respectively, and the PhD degree S. Sun, D. Kulp, and M. A. Siani-Rose, “NetAffx: Affymetrix probesets in computer science from Concordia University, and annotations.” NAR, vol. 31, no. 1, pp. 82–86, Jan. 2003. Canada, in 1996. His current research interests [31] B. Waegele, I. Dunger-Kaltenbach, G. Fobo, C. Montrone, Mewes, and are in the areas of databases, bioinformatics, A. Ruepp, “CRONOS: the cross-reference navigation server,” Bioinfor- and knowledge representation. In particular, he matics, vol. 25, no. 1, pp. 141–143, Jan. 2009. is interested in the management and querying of [32] U. Mudunuri, A. Che, M. Yi, and R. M. Stephens, “bioDBnet: the interaction and gene expression data, and their biological database network.” Bioinformatics, vol. 25, no. 4, pp. 555– applications in disease gene prioritization and 556, Feb. 2009. unfolded protein response. He is an associate professor in the De- [33] D. Baron, A. Bihouee,´ R. Teusan, E. Dubois, F. Savagner, M. Steenman, partment of Computer Science, University of Idaho. He was previously R. Houlgatte, and G. Ramstein, “MADGene: retrieval and processing on the faculty of Macquarie University, Sydney, Australia, Mississippi of gene identifier lists for the analysis of heterogeneous microarray State University, and Wayne State University. He is also a member of datasets.” Bioinformatics, vol. 27, no. 5, pp. 725–726, Mar. 2011. the Association for Computing Machinery, ACM Special Interest Group [34] T. Yoshida, H. Ono, A. Kuchiba, N. Saeki, and H. Sakamoto, “Genome- on Management of Data, the Association for Logic Programming, the wide germline analyses on cancer susceptibility and GeMDBJ database: International Society for Computational Biology, and the IEEE. Gastric cancer as an example,” Cancer Science, vol. 101, no. 7, pp. 1582–1589, 2010.

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].