Page 1 of 16

Developing an Expert Finding System using Medical Subject Headings (MeSH)

Authors

Harpreet Singh*+, Reema Singh+, Arjun Malhotra+ and Manjit Kaur+ *Corresponding Author

*Address for Correspondence +Bioinformatics Centre Indian Council of Medical Research Ansari Nagar New Delhi, 110029 Telephone: +91-11-26589556 E-Mail: [email protected] Page 2 of 16

Abstract

Motivation: Efficient identification of experts or expert communities is vital for the growth of any organization. Most of the available experts finding systems are based on self-nomination, which can be biased and are unable to rank experts.

Results: Using the MeSH terms associated with peer-reviewed publications from India we have developed an expert finding system. Our system is unique as it allows quantification of expertise, thus ranking of the experts and is based on peer-reviewed information. Use of MeSH terms as subjects has standardized the subject terminology. The system has been tested using known subject experts from India and has been found useful.

Conclusions: The expert finding system has been able to successfully identify subject experts from India. The system matches requirements of an ideal expert finding system.

Availability: http://bmi.icmr.org.in/expert. The scripts used in this work can be obtained from [email protected]. Page 3 of 16

Introduction

The ability to rapidly identify subject experts is essential for the successful functioning of an organization. Some of the advantages of an expert finding system include (a) rapid formation of operational or proposal teams to accelerate research [1](b) Identification of potential collaborators (c) matching reviewers to submitted research proposals, manuscripts and other peer-reviewed documents [2] (d) identification of expertise available within organization [3] (e) monitoring the research priorities of an organization and (f) prediction of the effects of skill loss (attrition or retirement) or gain (merger or acquisition).

Though useful, locating and evaluating subject experts is a difficult task as experts are rare, unevenly distributed, requirement of expertise seekers are often poorly articulated, lack of information about the past performance of experts and difficult to classify and quantify expertise. Relocation of experts further complicates the task of identifying subject experts. Finally, complex proposals/problems/issues require the combined wisdom of experts from multiple subjects. Thus there is a need to develop a computer based system for finding experts. Developments in this field can improve the time required for solving time-sensitive problems thus improving the overall efficiency of organization.

An Expert Finding System (EFS), also known as Expertise Location System (ELS) is a protocol/method/program that enables seekers to discover subject experts in order to hire or acquire their knowledge. EFS can make organization more efficient and effective by rapidly locating individuals or communities of experts to accelerate research and development. EFS enable rapid formation of operational teams, proposal teams, evaluation teams, taskforce groups, or support formation of cross disciplinary groups to guide activities of an organization or respond to new threats, challenges or opportunities. An efficient EFS can also be used to access organizational skill sets, enabling the identification of skill atrophy, discovery of new and emerging skills or prediction of the effects of skill loss (e.g. as a result of attiring or retirement) or gain (e.g. as a result of merger or acquisition).

An ideal Expert Finding System should have following qualities:

 Should be Robust – that is it should be easily and cost effectively updated without much user intervention

 Should be able to classify expertise into a standard subject classification schema. Page 4 of 16

 Should provide information about past performance of experts

 Should quantify expertise enabling ranking of experts

 Should be based on authentic source of information

 Should be able to form expert communities

 Should be able to identify locally available experts

Available Expert Finding Systems

Efforts are being made since last 20 years to develop an expert finding system. A number of Expert Finding Systems have been developed at both National and International level [3-6]. Broadly, the methods used for developing expert finding systems can be grouped into two categories.

Methods based on mining unstructured information

Unstructured information includes emails, corporate or personal web pages, wiki, reports etc. Text mining tools are used to index technical terms from unstructured documents, which can be queried to identify subject experts. Further the experts have been ranked using number of occurrence of technical terms. Some of the important tools using this information include email expertise extraction (e3) system [7], ContactFinder [8], MIT Expert Finder [9] etc.

However robust and useful, major limitation of this method is authentication of the information. Another issue is because of privacy complete information may not be available for mining. Finally, the three is lack of implementation of a standard subject terminology.

Methods based on social networking sites or contact management systems

Social network is a collection of likeminded people in close proximity to each there. This proximity can be physical or virtual. The main use of social networks includes helping people within the network, sharing knowledge in a particular area of specialization. A social network may consist of experts in a specific domain and other people who join the network to receive help from these experts with in a group of people. There are a number of social networking sites e.g. Researchgate (http://www.researchgate.net/), Nature Network (http://network.nature.com/), vivoweb and number of discussion forums etc. Experts have to feed in information about their subject expertise, domains etc. However useful, a major limitation of these methods is adding and updating the information. Many of the developed systems particularly discussion forms and Page 5 of 16 knowledge directories have become obsolete due to decrease in interest of experts. Experts may not be interested in subscribing to social sites or responding to queries. Further there is difficulty in ranking experts.

A common drawback to both approaches is that these are based on non-peer-reviewed information provided by the user hence can be biased.

It is clear from the above discussion that identifying correct subject experts is extremely vital for success of an organization. There is a strong need to develop an automated, unbiased expert finding system which is based on authentic information and can be easily updated.

Peer-reviewed articles can be an authentic source of expertise assessment. Identifying and evaluating expertise based on peer-reviewed articles can be unbiased and robust. However, the issue with using peer-reviewed articles is assigning standardized subjects to text. We attempted to address this issue by Medical Subject Headings (MeSH) terms associated with each article [10]. Using MeSH headings adds standardized subjects for querying. The latest release of the system can be used to search experts for a particular subject and the subjects associated with a particular expert from India. Page 6 of 16

Methods

Data Retrieval

PubMed is one of the largest repositories of peer-reviewed articles published world-wide. Publications originating from India were downloaded using an in-house developed script in XML format. The script used BioBiblio module of Bioperl for interacting with PubMed database over the internet.

The MeSH subjects were downloaded from MeSH browser in tree format.

Data pre-processing

Each XML record was parsed using XML::Twig module and relevant fields were extracted as intermediate text-file where each record begins with ‘START’ tag and ends with an ‘END’ tag. A sample of this intermediate file is given in box –I.

Database design

A database consisting of two tables was designed to store information downloaded from PubMed and MeSH tree. Structure of tables storing PubMed data is shown in Table 1.

For storing MeSH hierarchical data we followed Nested models. The structure of table is shown in Table 2.

The nested set is an efficient model for storing and searching through hierarchical data. The technique uses a method of storing metadata (left and right numbers) about the nodes (MeSH terms) contained in the tree in order to provide the SQL parser with information about how to “walk” the hierarchy of nodes. A critical aspect of the nested set model is that it alleviates the need for a recursive technique to find all children.

The following rules were used to calculate left and right numbers:

• For the root node in the hierarchy, the left side value will be 1, and the right side value will be 2*n, where n is the number of nodes in the tree. Page 7 of 16

• For all other nodes, the right side value will be left side + (2*n) + 1, where n is the total number of child nodes. For the leaf nodes (nodes without children), the right side value will always be equal to the left side value + 1.

• Left side value for any node is next free number, if we walk the tree counter clock wise

The MeSH tree data was imported to MySQL using PERL scripts.

Web interface

A web interface was developed using CGI Perl for finding experts using browser. The interface was linked with a content management system (Joomla: www.joomla.org) (Figure 1). Page 8 of 16

Results

Data

We downloaded 160,248 articles from PubMed which were published from India since 1980. A total of 54,091 MeSH terms were loaded into the MeSH database. There were 1,87,512 unique authors from 10,764 institutions in India.

Release 2.1 of our expert finding system allows users to find (a) Experts from a particular subject and (b) Subjects associated with given expert.

Application: Finding experts in a given subject

For locating a subject expert, the user has to enter either partial or complete name of the subject in the ‘Subject’ text box and click submit (Figure 2). On submission MeSH subjects similar to the query are displayed as drop down box (Figure 3). The user has to select the most relevant subject from the MeSH dropdown and submit. A list of experts ranked on the basis of number of articles published in selected subject is displayed (figure 4). The user can access year-wise list of all affiliations of author, all co-authors, MeSH subjects and all publications. The publications are linked to PubMed using pmid.

Application: Finding subjects associated with an expert

For locating subjects associated with an expert, the user has to enter either partial or complete name of the expert in the ‘Last name, First name or initials’ text box and click submit. On submission names of experts that are similar to the query are displayed as drop down box. The user has to select an expert from the drop down and submit. A list of subjects ranked on the basis of number of articles published in the subject is displayed. The user can access all publications of an expert, publication in given subject and can also search for other experts in a given subject. Page 9 of 16

Limitations

The expert finding system developed using MeSH terms has been tested using known subject experts from India and was found to be satisfactory. For example in the subject crystallography Prof. M. Vijayan, Prof. T.P. Singh, Prof. A. Srinivasan and Prof. M.R. Murthy are ranked as top experts in subject. All of these are well known experts in crystallography in India. Similarly the system was successfully tested for other subjects.

However, the system has some limitations and some which will be addressed in future releases of the system. One of the major limitation of using MeSH terms associated with an article as expertise of all the authors of that paper is that some of the experts may have different expertise and may have been merely co-author in the paper. For example in any biological work statistical assistance is required and any statistical expert may be associated with biological subject headings. However, given large number of articles from which the data is collected, the probability of association of an expert with related subjects is higher than unrelated subjects. In our experience for any subject in which author has more than 20 publications can be considered as associated with author.

Another limitation can be that many experts publish in journals that are not indexed in PubMed and since we have used data extracted from PubMed, the ranking of subject experts may be incomplete. There are some web services available for assigning MeSH terms to articles [11]. The integration of non-indexed journals will be done after verifying the credibility of source.

The developed system is unique as it uses a purely objective measure for identifying subject experts, the data used is peer-reviewed and reliable, and allows ranking of experts. For developing countries where resources for research are limited, developing such a system can improve the efficiency of research.

We are trying to improve the system by incorporating (i) graphical representation of co-author network (ii) Analysis tools for co-author network (iii) Clustering of affiliations to identify most probable affiliation and (iv) Geographical mapping of expertise. Page 10 of 16

Acknowledgement

The authors thank Indian Council of Medical Research for providing financial assistance for the work.

Funding: The work was supported by funding under the project “Biomedical Informatics Centres of ICMR” funded by Indian Council of Medical Research. Page 11 of 16

Bibliography

[1] M. T. Maybury, "Expert finding systems," MITRE Center for Integrated Intelligence Systems Bedford, Massachusetts, USA, 2006.

[2] D. Mimno and A. McCallum, "Expertise modeling for matching papers with reviewers," ACM, 2007, pp. 500-509.

[3] D. Yimam-Seid and A. Kobsa, "Expert-finding systems for organizations: Problem and domain analysis and the DEMOIR approach," Journal of Organizational Computing and Electronic Commerce, vol. 13, no. 1, pp. 1-24, 2003.

[4] X. Liu, W. B. Croft, and M. Koll, "Finding experts in community-based question-answering services,", 31 ed 2005, pp. 315-316.

[5] P. Serdyukov and D. Hiemstra, "Modeling documents as mixtures of persons for expert finding," Advances in Information Retrieval, pp. 309-320, 2008.

[6] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, "ArnetMiner: extraction and mining of academic social networks," ACM, 2008, pp. 990-998.

[7] B. Krulwich, C. Burkey, and A. Consulting, "The ContactFinder agent: Answering bulletin board questions with referrals," 1996, pp. 10-15.

[8] C. S. Campbell, P. P. Maglio, A. Cozzi, and B. Dom, "Expertise identification using email communications," ACM, 2003, pp. 528-531.

[9] A. S. Vivacqua, "Agents for expertise location," 1999, pp. 9-13.

[10] C. E. Lipscomb, "Medical subject headings (MeSH)," Bulletin of the Medical Library Association, vol. 88, no. 3, p. 265, 2000.

[11] D. Trieschnigg, P. Pezik, V. Lee, F. De Jong, W. Kraaij, and D. Rebholz-Schuhmann, "MeSH Up: effective MeSH text classification for improved document retrieval," Bioinformatics, vol. 25, no. 11, pp. 1412-1418, 2009.

+------+------+------+-----+------+------+ | Field | Type | Null | Key | Default | Extra | Page 12 of 16

+------+------+------+-----+------+------+ | category_id | int(11) | NO | PRI | 0 | | | name | text | YES | MUL | NULL | | | left_side | int(11) | YES | | NULL | | | right_side | int(11) | YES | | NULL | | +------+------+------+-----+------+------+

Table 2: Structure of MeSH table for storing hierarchical data. Page 13 of 16

Figure 1: Joomla based interface of Expert Finding System Page 14 of 16

Figure 2: Entering complete or partial subject Page 15 of 16

Figure 3: Selecting list of MeSH subjects related to entered text

Figure 4: List of subject experts and their affiliations Page 16 of 16

Figure 5: Searching subjects associated with an Expert.