<<

Details from a distance? CLIN26 Amsterdam A Dutch pipeline for event detection

Chantal van Son, Marieke van Erp, Antske Fokkens, Paul Huygen, Ruben Izquierdo Bevia & Piek Vossen CLOSE READING DISTANT READING NewsReader & BiographyNet

Apply the detailed analyses typically associated with close (or at least non-distant) reading to large amounts of textual data

✤ NewsReader: (financial) news data ✤ Day-by-day processing of news; reconstructing a coherent story ✤ Four languages: English, Spanish, Italian, Dutch

✤ BiographyNet: biographical data ✤ Answering historical questions with digitalized data from Biography Portal of the Netherlands (www.biografischportaal.nl) using computational methods

http://www.newsreader-project.eu/ http://www.biographynet.nl/ Event extraction and representation

1. Intra-document processing: Dutch NLP pipeline generating NAF files ✤ pipeline includes tokenization, , SRL, NERC, etc. ✤ NLP Annotation Format (NAF) is a multi-layered annotation format for representing linguistic annotations in complex NLP architectures

2. Cross-document processing: event coreference to convert NAF to RDF representation

3. Store the NAF and RDF in a KnowledgeStore Event extraction and representation

1. Intra-document processing: Dutch NLP pipeline generating NAF files ✤ pipeline includes tokenization, parsing, SRL, NERC, etc. ✤ NLP Annotation Format (NAF) is a multi-layered annotation format for representing linguistic annotations in complex NLP architectures

2. Cross-document processing: event coreference to convert NAF to RDF

3. Store the NAF and RDF in a KnowledgeStore PoS tagging, Tokenization lemmatization (IXA-pipe) and parsing (Alpino)

Document NLP Annotation Format (NAF)

PoS tagging, Tokenization lemmatization (IXA-pipe) and parsing (Alpino)

Document PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document

PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document

WSD Predicate Matrix (Cornetto/ODWN) tagging

PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document

WSD Predicate Matrix (Cornetto/ODWN) tagging PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document

WSD Predicate Matrix (Cornetto/ODWN) tagging

SRL FrameNet labeling (SoNaR SRL)

PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document

WSD Predicate Matrix (Cornetto/ODWN) tagging

SRL FrameNet labeling (SoNaR SRL) PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document

WSD Predicate Matrix (Cornetto/ODWN) tagging

SRL FrameNet labeling (SoNaR SRL)

Event coreference PoS tagging, Tokenization lemmatization Opinions (IXA-pipe) and parsing (Alpino)

NED NERC (DBpedia (IXA-pipe) Spotlight)

Time expressions (HeidelTime) Document

WSD Predicate Matrix (Cornetto/ODWN) tagging

SRL FrameNet labeling (SoNaR SRL)

Developed by CLTL

Event coreference Wrappers/reimplementations Demo of the Dutch pipeline

http://kyoto.let.vu.nl/~huygen/test/test.php Event extraction and representation

1. Intra-document processing: Dutch NLP pipeline generating NAF files

2. Cross-document processing: event coreference to convert NAF to RDF ✤ uses GAF to formally distinguish between mentions (represented in NAF) and instances (represented in RDF-based SEM) ✤ involves and linking textual expressions referring to same entity or event

3. Store the NAF and RDF in a KnowledgeStore Conclusion

✤ First full Dutch NLP pipeline providing rich information about events:

WHAT happened WHERE and WHEN, and WHO was involved?

✤ With recent addition of Open Dutch WordNet (ODWN), all software is fully open-source

References (1)

✤ IXA-pipe: Agerri, R., Bermudez, J., and Rigau, G. (2014). IXA pipeline: Efficient and ready to use multilingual NLP tools. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014), pages 26–31. http://ixa2.si.ehu.es/ixa-pipes/

✤ Alpino: Van der Beek, L., Bouma, G., Malouf, R., and Van Noord, G. (2002). The Alpino dependency . Language and Computers, 45(1):8–22. http://www.let.rug.nl/ vannoord/alp/Alpino/

✤ HeidelTime: Strötgen, J., & Gertz, M. (2010, July). HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 321-324. https://github.com/HeidelTime/heideltime

✤ SoNaR Semantic Role Labeler: De Clercq, O., Hoste, V., and Monachesi, P. (2012). Evaluating automatic cross-domain semantic role annotation. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012), pages 88–93. References (2)

✤ Predicate Matrix: Lopez de Lacalle, M., Laparra, E., and Rigau, G. (2014b). Predicate Matrix: extending SemLink through WordNet mappings. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014). http://adimen.si.ehu.es/web/PredicateMatrix

✤ Open Source Dutch WordNet: Postma, M. and Vossen, P. (2014). Open Source Dutch WordNet. Technical report, VU University Amsterdam. http://wordpress.let.vupr.nl/ odwn/

✤ FrameNet labeling: Van Son, C.M. (2015) Towards a Dutch Frame-Semantic Parser. Master’s thesis, VU University Amsterdam.

✤ DBpedia Spotlight: Mendes, P. N., Jakob, M., García-Silva, A. and Bizer, C. (2011). DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, pages. 1-8. https://github.com/dbpedia-spotlight References (3)

✤ Software presented today: https://github.com/cltl/ http://www.newsreader-project.eu/results/software/

www.cltl.nl