Article Reference
Total Page:16
File Type:pdf, Size:1020Kb
Article Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines BABIC, Zeljana, et al. Abstract The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database. To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines. Reference BABIC, Zeljana, et al. Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines. eLife, 2019, vol. 8, p. e41676 DOI : 10.7554/eLife.41676 PMID : 30693867 Available at: http://archive-ouverte.unige.ch/unige:119832 Disclaimer: layout of this document may differ from the published version. 1 / 1 FEATURE ARTICLE META-RESEARCH Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines Abstract The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database. To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines. DOI: https://doi.org/10.7554/eLife.41676.001 ZELJANA BABIC, AMANDA CAPES-DAVIS, MARYANN E MARTONE, AMOS BAIROCH, I BURAK OZYURT, THOMAS H GILLESPIE AND ANITA E BANDROWSKI* Introduction Authentication Committee (ICLAC) to create a Cell lines are widely used in the biological scien- register of contaminated and misidentified cell ces, partly because they are able to multiply lines, increase awareness of the problem, and indefinitely. This property means that scientists propose approaches to decrease the use of such can, in theory, exactly replicate previous studies cell lines (Masters, 2012). ICLAC and other and then build on these results (American Type expert groups (such as the American Type Cul- Culture Collection Standards Development ture Collections: ATCC) propose that one of the Organization Workgroup ASN-0002, most practical ways to detect contamination is 2010). However, mislabeling or mishandling can to make use of a DNA-based technique called *For correspondence: result in misidentification, contamination or dis- short-tandem-repeat profiling to authenticate [email protected] tribution of problematic cell lines which, in turn, human cell lines (American Type Culture Collec- Competing interest: See can affect the validity of research data. Despite tion Standards Development Organization page 13 this, testing for contamination and misidentifica- Workgroup ASN-0002, 2010). Additionally, the Funding: See page 13 tion is commonly not performed and, therefore, Cellosaurus database, housed at the Swiss Insti- a laboratory could obtain contaminated cell lines tute of Bioinformatics, contains extensive infor- Reviewing editor: Helga Groll, and spend significant resources pursuing mation on more than 100,000 cell lines eLife, United Kingdom research based on faulty premises (Bairoch, 2018), including information about cell Copyright Babic et al. This (Drexler et al., 2017). A particular problem is line misidentification and other problems (such article is distributed under the the contamination of a cell line by a cancer cell as incorrect tissue origin, which is not detected terms of the Creative Commons line such as HeLa (Capes-Davis et al., 2010) by short-tandem-repeat profiling). Attribution License, which permits unrestricted use and because cancer cells tend to grow more rapidly Despite the availability of the ICLAC register, redistribution provided that the than normal cells. the overall rate of misidentified cell line use has original author and source are In 2012, a group of scientists established an not fallen across the literature (Horbach and credited. organization called the International Cell Line Halffman, 2017). It has been shown that Babic et al. eLife 2019;8:e41676. DOI: https://doi.org/10.7554/eLife.41676 1 of 14 Feature article Meta-Research Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines publishers checking the manuscript during sub- also includes cell lines that have been labeled mission for misidentified cell lines can eliminate with the wrong type of cancer, but these may their use, although such a process requires sig- still be safely used if the researchers know the nificant time and cost (Fusenig et al., 2017). We true identity of the line. We must, therefore, believe that the continued use of misidentified exercise caution when interpreting these results. cell lines in published studies is due, in part, to More information about the problematic list researchers not checking the lists of misidenti- please is given in the discussion section and fied cell lines consistently. Supplementary file 2. Research Resource Identifiers (RRIDs) are unique identifiers that can be included in the Text mining corpus methods section of a research paper to define To derive a dataset of reported cell lines from the cell line, antibody, transgenic organism or the general literature, we used text mining of software used (Bandrowski et al., 2016; the PubMed Central open access subset. This Bandrowski and Martone, 2016). These identi- task requires the use of natural language proc- fiers were introduced largely because antibodies essing, as cell-line names are not unique strings. are a known source of variation in experiments, For example, looking for a set of characters, yet researchers did not include enough meta- such as ‘H2’, may reveal papers where H2 data to identify which antibody they used denotes a cell line, gene, protein, antibody or a (Vasilevsky et al., 2013). RRIDs are slowly figure. We used the SciScore tool, a ‘Named becoming more common in the literature, espe- Entity Recognition-based algorithm’ specialized cially in some disciplines such as neuroscience. for scientific resources, which can in principle They are issued by the naming authority for a recognize the H2 cell line and can ignore its particular type of resource (for example, stock mentions referring to something other than a centers and model-organism databases issue cell line. RRIDs for organisms; Cellosaurus issues RRIDs The algorithm was taught to consider fea- for cell lines; the antibody registry issues RRIDs tures around the recognized term that are asso- for antibodies; and the SciCrunch registry issues ciated with cell lines. SciScore was trained using RRIDs for software and other digital resources). 1,457 annotations made by a curator as well as a RRIDs for cell lines were first incorporated complete list of cell lines and synonyms from into the RRID portal in 2016, and journals often Cellosaurus as a set of seed data. We split the require researchers to look up each antibody, data into 90% training, and 10% testing of ran- animal or cell line on this website (which is syn- dom sets ten times, and obtained an average chronized with the Cellosaurus database, and precision of 87.3% (+/- sd 4.65%), recall 61.9% therefore contains information about cell line (+/-sd 7.01%) with an overall F1 mean of 72.2% contamination or misidentification). This require- (+/- sd 5.39%). ment creates a natural experiment to address This algorithm was then deployed on the ~2 the question: if researchers are alerted about million articles in the open access subset of the misidentification of a cell line before they PubMed Central. It recognized 305,161 unique publish, will they still report data from such a cell-line names from a total of 150,459 unique cell line? articles (Figure 1), which we treated as our pop- ulation of cell lines that was considered for fur- Results ther analysis. These data were then provided to In this study, we used text mining to identify Cellosaurus curators, allowing them to create papers that included RRIDs and papers that entries for 22 cell lines that had been so far over- listed cell lines, and then compared the preva- looked, and to add 18 synonyms and eight mis- lence of misidentified cell lines in these two sam- spellings for existing entries, making Cellosaurus ples. It should be stressed that the use of cell a more complete resource. lines on the problematic list does not automati- As cell-line names are not standardized in cally mean that a given cell line is being most papers, we used three different employed improperly. For example, the prob- approaches to estimate the percentage of prob- lematic list includes cases where a cell line is lematic cell lines in the general literature. First, ‘partially contaminated’, which does not affect to obtain a lower boundary, we determined cells purchased from, say, a stock center. The list whether the cell line name, as stated by the Babic et al.