Improving Search Efficiency for Economic Evaluations in Major Databases

Improving search efficiency for economic evaluations in major databases using semantic technology

Julie Glanville(1), Bill Porter(2), Pamela Negosanti(2), Carol Lefebvre(3)

(1) York Health Economics Consortium, University of York, York, YO10 5NH, United Kingdom. Email: [email protected] (2) Expert System SpA, Modena, Italy. www.expertsystem.net. Email: [email protected] (3) UK Cochrane Centre, National Institute for Health Research, Oxford, United Kingdom. Email: [email protected]

Objective

Many technology appraisals require evidence from economic studies, and in particular from economic evaluations, such as cost-benefit, cost-effectiveness or cost-utility studies. Identifying economic evaluations efficiently in major databases is problematic because it is difficult to find terms which distinguish economic evaluations effectively from other studies, in particular terms which distinguish them from other economic studies which are not economic evaluations. Available search filters are sensitive but have low precision which means many irrelevant records have to be sifted manually to identify the few relevant records.(1) Semantic technology software understands automatically the meaning of text written in natural language. This research explores whether semantic technology post-processing software, COGITO® Studio Discover, can help to improve search efficiency for difficult to identify study designs such as economic evaluations.

Methods

We identified a gold standard set of known economic evaluation records from the NHS EED database published in 3 years (2000, 2003 and 2006) and retrieved their matching MEDLINE records (2). The records consisted of cost-benefit studies, cost-effectiveness studies and cost-utility studies as shown in Table 1.

We also identified a comparison set of records which were not economic evaluations (but which contained economic text words such as cost) from MEDLINE for the same years and retrieved their matching MEDLINE records. (Table 1).

The records were imported into the semantic environment tool COGITO® Studio Discover in XML format. The XML structure allowed COGITO to use the different fields of the records to inform the rules of the filter processing, allowing different rules to be used for the different fields, for example the title and abstract. The advantage of using semantic rules lies in two main areas: managing synonyms and polysemy. Managing synonyms means that using a powerful semantic network the system is able to understand the concepts, recognizing them according to their meaning and not to the way they are written. Polysemes and homonyms management uses context information so that the algorithm is able to identify the correct meaning of a word even if the word could have different meanings.

Within the COGITO environment the two sets of records were each divided randomly into 2 subsets. One subset formed the training (test) sets of records and the second formed the validation sets of records on which the performance of the semantic rules could be validated. We trained Cogito semantic technology software to recognize economic evaluations from records which contained economic text words but which were not economic evaluations using the test subsets of the gold standard and comparator records. We created semantic rules and tested the precision and sensitivity of those rules in identifying economic evaluations accurately in the test subset of the gold standard. We then tested out how well the software performed in distinguishing economic evaluations from comparator records which were not economic evaluations records in the validation sets of records.

Results

The training process yielded a sensitivity of 100% and precision of 82.77% in the test set (Table 2). When the rules were tested in the validation set the sensitivity was maintained at 100% and the precision reduced to 71.69%. This represented a Number Needed to Read of 1.21 in the test set and 1.39 in the validation set.

Discussion

In a recent assessment of the performance of economic evaluation search filters in finding these same gold standard records the best performing MEDLINE filters in terms of sensitivity also achieved 100% sensitivity. However, the most sensitive filters had very low precision at 2%.(1,3) This represented a Number Needed to Read of 50. The highest precision achieved in the assessment was 26% but that strategy (designed as part of the project) had only 72% sensitivity. The current research indicates that with this technology it is possible to achieve much higher precision with no sacrifice in sensitivity.

The value of the approach used in this study is that it focuses on the meaningful co-presence of key terms within records to identify relevance as opposed to simply co-presence. This approach is difficult to achieve in the current interfaces to bibliographic databases such as MEDLINE because of the strictures of the Boolean approach and the stepwise nature of searching using set combination. The issue of how best to leverage the benefits of COGITO in conjunction with databases such as MEDLINE needs to explored. At present using COGITO is likely to be a two-step process involving a sensitive search of the database, loading the results into COGITO and then running the semantic rules against the result set. The benefits of loading database records into COGITO and running queries directly (one-step process) need to be explored so that a one-step process could be facilitated.

Conclusions

Initial exploration has shown that it is feasible to develop semantic rules to identify economic evaluation records efficiently from among a mix of evaluations and records which were not economic evaluations. Semantic technology, for post-processing of search results achieved from sensitive searches, may provide a helpful solution to the current challenges of identifying difficult to distinguish study designs such as economic evaluations, observational studies, quality of life studies, patient preferences and diagnostic test accuracy studies. This also seems to be a promising technology to explore in terms of improving the precision of searching for economic evaluations among records obtained from EMBASE where extensive indexing can actually impede efficient retrieval at present. If COGITO rules can be developed to efficiently retrieve hard to distinguish studies accurately there may be real benefits for health technology assessment in terms of reducing the resources required to scan records for relevance.

Table 1. Numbers of economic evaluation records and comparator records identified from MEDLINE. Year Number of NHS EED Number of records in Published records with matching MEDLINE comparator MEDLINE records: set MEDLINE gold standard 2000 577 1,226 2003 618 1,335 2006 755 1,575 Total 1,950 4,136

Table 2. Performance of semantic rules in identifying gold standard records in test and validation sets of records.

Test set Validation set (GS=975) (GS=975) (Comparator set (Comparator set 2068) 2068) Gold standard records 975 975 retrieved Comparator records 203 385 retrieved Sensitivity (number GS 100% 100% retrieved/number of GS) Precision (number of GS 82.77% 71.69% retrieved/number of records retrieved)

References

(1) Glanville J, Fleetwood K, Yellowlees A, Kaunelis D, Mensinkai S. Development and Testing of Search Filters to Identify Economic Evaluations in MEDLINE and EMBASE. Ottawa: Canadian Agency for Drugs and Technologies in Health; 2009. (2) Centre for Reviews and Dissemination. NHS Economic Evaluation Database [database online]. York: Centre for Reviews and Dissemination; 2010. Available from: http://www.crd.york.ac.uk/crdweb/ (3) Glanville J, Kaunelis D, Mensinkai S. How well do search filters perform in identifying economic evaluations in MEDLINE and EMBASE. International Journal of Technology Assessment in Health Care 2009;25:522-529.