Event Extraction on Pubmed Scale Filip Ginter1*, Jari Björne1,2, Sampo Pyysalo3 from Workshop on Advances in Bio Text Mining Ghent, Belgium

Ginter et al. BMC Bioinformatics 2010, 11(Suppl 5):O2 http://www.biomedcentral.com/1471-2105/11/S5/O2 ORAL PRESENTATION Open Access Event extraction on PubMed scale Filip Ginter1*, Jari Björne1,2, Sampo Pyysalo3 From Workshop on Advances in Bio Text Mining Ghent, Belgium. 10-11 May 2010 There has been a growing interest in typed, recursively Author details 1Department of Information Technology, University of Turku, Finland. 2Turku nested events as the target for information extraction in 3 ’ Centre for Computer Science (TUCS), Finland. Department of Computer the biomedical domain. The BioNLP 09 Shared Task on Science, University of Tokyo, Japan. Event Extraction [1] provided a standard definition of events and established the current state-of-the-art in Published: 6 October 2010 event extraction through competitive evaluation on a References standard dataset derived from the GENIA event corpus. 1. Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J: Overview of BioNLP’09 Shared We have previously established the scalability of event Task on Event Extraction. In Proceedings of the BioNLP 2009 Workshop extraction to large corpora [2] and here we present a Companion Volume for Shared Task ACL 2009, 1-9. 2. Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T: Complex Event Extraction follow-up study in which event extraction is performed at PubMed Scale. Proceedings of ISMB’10 2010, 26(12):i382-i390. from the titles and abstracts of all 17.8M citations in the 3. Leaman R, Gonzalez G: BANNER: an executable survey of advances in 2009 release of PubMed. The extraction pipeline is com- biomedical named entity recognition. Proceedings of Pacific Symposium on Biocomputing 2008, 652-663. posed of state-of-the-art methods: the BANNER named 4. McClosky D: Any Domain Parsing: Automatic Domain Adaptation for entity recognizer [3], the McClosky-Charniak domain- Natural Language Parsing. PhD thesis Department of Computer Science, adapted parser [4], and the Turku Event Extraction Brown University 2009. 5. Björne J, Heimonen J, Ginter F, Airola A, Pahikkala T, Salakoski T: Extracting System [5], the winning entry of the Shared Task. Contextualized Complex Biological Events with Rich Graph-Based The resulting dataset consists of over 19.2M instances Feature Sets. Computational Intelligence 2010. of 4.5M unique events, of which 2.1M instances of 1.6M doi:10.1186/1471-2105-11-S5-O2 unique events recursively involve at least two different Cite this article as: Ginter et al.: Event extraction on PubMed scale. BMC named entities. This dataset is several orders of magni- Bioinformatics 2010 11(Suppl 5):O2. tude larger than any previous event extraction effort and – having been obtained by a demonstrably state-of- the-art pipeline — represents the most accurate event extraction output achievable with presently available tools. Compiling the dataset was a technically challenging undertaking and required roughly 8,300 CPU-hours. As the primary contribution of the study, we make the entire set of extracted events freely available at http:// bionlp.utu.fi, together with the output of the individual stages of the pipeline, such as 36.5M named entity Submit your next manuscript to BioMed Central instances and syntactic analyzes for all 20M sentences and take full advantage of: containing at least one named entity. This resource will facilitate future research related to biological event net- • Convenient online submission works by providing a standard, publicly available, large- • Thorough peer review scale dataset, avoiding the unnecessary duplication of • No space constraints or color ﬁgure charges efforts in executing the complex event extraction pipeline. • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar * Correspondence: [email protected] • Research which is freely available for redistribution 1Department of Information Technology, University of Turku, Finland Full list of author information is available at the end of the article Submit your manuscript at www.biomedcentral.com/submit © 2010 Ginter et al; licensee BioMed Central Ltd..

Event Extraction on Pubmed Scale Filip Ginter1*, Jari Björne1,2, Sampo Pyysalo3 from Workshop on Advances in Bio Text Mining Ghent, Belgium

High-Throughput Analysis Suggests Differences in Journal False

Tracking the Popularity and Outcomes of All Biorxiv Preprints

A Comparison of Feature Selection and Classification Methods in DNA

Use and Mis-Use of Supplementary Material in Science Publications Mihai Pop1* and Steven L

Viewed: 8 Were Accepted As Full Papers, 4 As Poster Large Investment of Effort Needed to Develop Sufficiently Papers and One As a Demo Paper

Mining Locus Tags in Pubmed Central to Improve Microbial Gene Annotation Chris J Stubben* and Jean F Challacombe

Analyzing the Field of Bioinformatics with the Multi-Faceted Topic Modeling Technique Go Eun Heo1, Keun Young Kang1, Min Song1* and Jeong-Hoon Lee2

Semantic Web Applications and Tools for the Life Sciences: SWAT4LS 2010

Paperbot: Open-Source Web-Based Search and Metadata Organization of Scientific Literature Patricia Maraver1,2, Rubén Armañanzas1,2, Todd A

BMC Bioinformatics Biomed Central

Knowledge Discovery of Drug Data on the Example of Adverse Reaction Prediction Pinar Yildirim1*, Ljiljana Majnarić2, Ozgur Ilyas Ekmekci1, Andreas Holzinger3

Interactive Metagenomic Visualization in a Web Browser Ondov Et Al