IXA-Pipe) and Parsing (Alpino)
Total Page:16
File Type:pdf, Size:1020Kb
Details from a distance? CLIN26 Amsterdam A Dutch pipeline for event detection Chantal van Son, Marieke van Erp, Antske Fokkens, Paul Huygen, Ruben Izquierdo Bevia & Piek Vossen CLOSE READING DISTANT READING NewsReader & BiographyNet Apply the detailed analyses typically associated with close (or at least non-distant) reading to large amounts of textual data ✤ NewsReader: (financial) news data ✤ Day-by-day processing of news; reconstructing a coherent story ✤ Four languages: English, Spanish, Italian, Dutch ✤ BiographyNet: biographical data ✤ Answering historical questions with digitalized data from Biography Portal of the Netherlands (www.biografischportaal.nl) using computational methods http://www.newsreader-project.eu/ http://www.biographynet.nl/ Event extraction and representation 1. Intra-document processing: Dutch NLP pipeline generating NAF files ✤ pipeline includes tokenization, parsing, SRL, NERC, etc. ✤ NLP Annotation Format (NAF) is a multi-layered annotation format for representing linguistic annotations in complex NLP architectures 2. Cross-document processing: event coreference to convert NAF to RDF representation 3. Store the NAF and RDF in a KnowledgeStore Event extraction and representation 1. Intra-document processing: Dutch NLP pipeline generating NAF files ✤ pipeline includes tokenization, parsing, SRL, NERC, etc. ✤ NLP Annotation Format (NAF) is a multi-layered annotation format for representing linguistic annotations in complex NLP architectures 2. Cross-document processing: event coreference to convert NAF to RDF 3. Store the NAF and RDF in a KnowledgeStore PoS tagging, Tokenization lemmatization (IXA-pipe) and parsing (Alpino) Document NLP Annotation Format (NAF) PoS tagging, Tokenization lemmatization (IXA-pipe) and parsing (Alpino) Document PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document WSD Predicate Matrix (Cornetto/ODWN) tagging PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document WSD Predicate Matrix (Cornetto/ODWN) tagging PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document WSD Predicate Matrix (Cornetto/ODWN) tagging SRL FrameNet labeling (SoNaR SRL) PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document WSD Predicate Matrix (Cornetto/ODWN) tagging SRL FrameNet labeling (SoNaR SRL) PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document WSD Predicate Matrix (Cornetto/ODWN) tagging SRL FrameNet labeling (SoNaR SRL) Event coreference PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino) NED NERC (DBpedia (IXA-pipe) Spotlight) Time expressions (HeidelTime) Document WSD Predicate Matrix (Cornetto/ODWN) tagging SRL FrameNet labeling (SoNaR SRL) Developed by CLTL Event coreference Wrappers/reimplementations Demo of the Dutch pipeline http://kyoto.let.vu.nl/~huygen/test/test.php Event extraction and representation 1. Intra-document processing: Dutch NLP pipeline generating NAF files 2. Cross-document processing: event coreference to convert NAF to RDF ✤ uses GAF to formally distinguish between mentions (represented in NAF) and instances (represented in RDF-based SEM) ✤ involves document clustering and linking textual expressions referring to same entity or event 3. Store the NAF and RDF in a KnowledgeStore Conclusion ✤ First full Dutch NLP pipeline providing rich information about events: WHAT happened WHERE and WHEN, and WHO was involved? ✤ With recent addition of Open Dutch WordNet (ODWN), all software is fully open-source References (1) ✤ IXA-pipe: Agerri, R., Bermudez, J., and Rigau, G. (2014). IXA pipeline: Efficient and ready to use multilingual NLP tools. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014), pages 26–31. http://ixa2.si.ehu.es/ixa-pipes/ ✤ Alpino: Van der Beek, L., Bouma, G., Malouf, R., and Van Noord, G. (2002). The Alpino dependency treebank. Language and Computers, 45(1):8–22. http://www.let.rug.nl/ vannoord/alp/Alpino/ ✤ HeidelTime: Strötgen, J., & Gertz, M. (2010, July). HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 321-324. https://github.com/HeidelTime/heideltime ✤ SoNaR Semantic Role Labeler: De Clercq, O., Hoste, V., and Monachesi, P. (2012). Evaluating automatic cross-domain semantic role annotation. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012), pages 88–93. References (2) ✤ Predicate Matrix: Lopez de Lacalle, M., Laparra, E., and Rigau, G. (2014b). Predicate Matrix: extending SemLink through WordNet mappings. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014). http://adimen.si.ehu.es/web/PredicateMatrix ✤ Open Source Dutch WordNet: Postma, M. and Vossen, P. (2014). Open Source Dutch WordNet. Technical report, VU University Amsterdam. http://wordpress.let.vupr.nl/ odwn/ ✤ FrameNet labeling: Van Son, C.M. (2015) Towards a Dutch Frame-Semantic Parser. Master’s thesis, VU University Amsterdam. ✤ DBpedia Spotlight: Mendes, P. N., Jakob, M., García-Silva, A. and Bizer, C. (2011). DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, pages. 1-8. https://github.com/dbpedia-spotlight References (3) ✤ Software presented today: https://github.com/cltl/ http://www.newsreader-project.eu/results/software/ www.cltl.nl.